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Abstract 


The Direct Access File System (DAFS) isan emerg- 
ing industrial standard for network-attached stor- 
age. DAFS takes advantage of new user-level net- 
work interface standards. This enables a user-level 
file system structure in which client-side function- 
ality for remote data access resides in a library 
rather than in the kernel. This structure addresses 
longstanding performance problems stemming from 
weak integration of buffering layers in the network 
transport, kernel-based file systems and applica- 
tions. The benefits of this architecture include 
lightweight, portable and asynchronous access to 
network storage and improved application control 
over data movement, caching and prefetching. 


This paper explores the fundamental perfor- 
mance characteristics of a user-level file system 
structure based on DAFS. It presents expen mental 
results from an open-source DAFS prototype and 
compares its performance to a kernel-based NFS 
implementation optimized for zero-copy data trans- 
fer. The results show that both systems can deliver 
fille access throughput in excess of 100 MB/s, sat- 
urating network links with similar raw bandwidth. 
Lower client overhead in the DAFS configuration 
can improve application performance by up to 40% 
over optimized NFS when application processing 
and I/O demands are well-balanced. 


1 Introduction 


The performance of high-speed network stor- 
age systems is often limited by client overhead, 
such aS memory copying, network access costs and 
protocol overhead [2, 8, 20, 29]. A related source 
of inefficiency stems from poor integration of ap- 
plications and file system services; lack of con- 
trol over kernel policics leads to problems such as 


double caching, false prefetching and poor concur- 
rency management [34]. As a result, databases and 
other performance-critical applications often bypass 
fle systems in favor of raw block storage access. 
This sacrifices the benefits of the file system model, 
including ease of administration and safe shanng 
of resources and data. These problems have also 
motivated the design of radical operating system 
structures to allow application control over resource 
management [21, 31]. 

The recent emergence of commercial direct- 
access transport networks creates an opportunity to 
address these issues without changing operating sys- 
tems in common use. These networks incorporate 
two defining features: user-level networking and re- 
mote direct memory access (RDMA). User-level net- 
working allows safe network communication directly 
fro user-mode applications, removing the kernel 
from the critical I/O path. RDMA allows the net- 
work adapter to reduce copy overhead by accessing 
application buffers directly. 

The Direct Access File System (DAFS) [14] is 
a new standard for network-attached storage over 
direct-access transport networks. The DAFS proto- 
col is based on the Network File System Version 4 
protocol [32], with added protocol features for direct 
data transfer using RDMA, scatter/gather list I/O, 
reliable locking. command flow-control and session 
recovery. DAFS is designed to enable a user-level 
file system client: a DAFS client may run as an ap- 
plication library above the operating system kerndcl, 
with the kernel’s role limited to basic network device 
support and memory management. This structure 
can improve performance, portability and reliabil 
ity, and offer applications fully asynchronous I/O 
and more direct control over data movement and 
caching. Network Appliance and other network- 
attached storage vendors are planning DAFS inter- 
faces for their products. 
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This paper explores the fundamental struc- 
tural and performance charactenstics of network file 
access using a user-level file system structure on a 
direct-access transport network with RDMA. We 
use DAFS as a basis for exploring these features 
since it is the first fully-specified file system pro- 
tocol to support them. We describe DAFS-based 
clent and server reference implementations for an 
open-source Unix system (FreeBSD) and report ex- 
pen mental results, comparing DAFS to a zero-copy 
NFS implementation. Our purpose is to illustrate 
the benefits and tradeoffs of these techniques to pro- 
vide a basis tor informed choices about deployment 
of DAFS-based systems and similar extensions to 
other network file protocols, such as NFS. 


Our experiments explore the application prop- 
erties that determine how RDM A and user-level file 
systems affect performance. For example, when a 
workload is balanced (i.e., the application simulta- 
neously saturates the CPU and network link) DAFS 
delivers the most benefit compared to more tradi- 
tional architectures. When workloads are limited by 
the disk, DAFS and more traditional network file 
systems behave comparably. Other workload fac- 
tors such as metadata-intensity, I/O sizes, file sizes, 
and I/O access pattern also influence performance. 


An important property of the user-level file 
system structure is that applications are no longer 
bound by the kernel’s policies for file system buffer- 
ing, caching and prefetching. The user-level file 
system structure and the DAFS API allow appli- 
cations full contro] over file system access; how ever, 
the application can no longer benefit from shared 
kernel facilities for caching and prefetching. A sec- 
ondary goal of our work is to show how adaptation 
libraries for specific classes of applications enable 
those applications to benefit from improved control 
and tighter integration with the file system, while 
reducing or eliminating the burden on application 
developers. We present experiments with two adap- 
tation libraries for DAFS clients: Berkeley DB [28] 
and the TPIE external memory I/O toolkit [37]. 
These adaptation libraries provide the benefits of 
the user-level file system without requiring that ap- 
plications be modified to use the DAFS API. 


The layout of this paper is as follows. Section 2 
summarizes the trends that motivated DAFS and 
user-level file systems and sets our study in context 
with previous work. Section 3 gives an overview of 
the salient features of the DAFS specifications, and 
Section 4 describes the DAFS reference implemen- 
tation used in the experiments. Section 5 presents 
two example adaptation libraries, and Section 6 de- 
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scribes zero-copy, kernel-based NFS as an alterna- 
tive to DAFS. Section 7 presents experimental re- 
sults. We conclude in Section 8. 


2 Background and Related Work 


In this section, we discuss the previous work 
that lays the foundation for DAFS and provides the 
context for our experimental results. We begin with 
a discussion of the issues that limit performance in 
network storage systems and then discuss the two 
critical architectural features that we examine to at- 
tack performance bottlenecks: direct-access trans- 
ports and user-level file systems. 


2.1 Network Storage Performance 


Network storage solutions can be categonzed 
as Storage-Area Network (SAN)}-based solutions, 
which provide a block abstraction to clients, and 
Network-Attached Storage (NAS)-based solutions, 
which export a network file system interface. Be- 
cause a SAN storage volume appears as a local disk, 
the client has full control over the volume’s data 
layout; client-side file systems or database software 
can run unmodified [23]. However, this precludes 
concurrent access to the shared volume from other 
clients, unless the client softwareis extended to co- 
ordinate its accesses with other clients [36]. In con- 
trast, a NAS-based file service can contro! sharing 
and access for individual files on a shared volume. 
This approach allows safe data sharing across di- 
verse clients and applications. 


Communication overhead was a key factor 
driving acceptance of Fibre Channel [20] as a high- 
performance SAN. Fibre Channel leverages network 
interface controller (NIC) support to offload trans- 
port processing from the host and access I/O blocks 
in host memory directly without copying. Recently, 
NICs supporting the emerging iSCSI block storage 
standard have entered the market as an IP-based 
SAN alternative. In contrast, NAS solutions have 
typically used IP-based protocols over conventional 
NICs, and have paid a performance penalty. The 
most-often cited causes for poor performance of net- 
work file systems are (a) protocol processing in net- 
work stacks; (b) memory copies [2, 15, 29, 35]; and 
(c) other kernel overhead such as system calls and 
context switches. Data copying, in particular, in- 
curs substantial per-byte overhead in the CPU and 
memory system that is not masked by advandng 
processor technology. 


One way to reduce network storage access over- 
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head is to offload some or all of the transport proto- 
col processing to the NIC. Many network adapters 
can compute Internet checksums as data moves to 
and from host memory; this approach is relatively 
simple and delivers a substantial benefit. An in- 
creasing number of adapters can offload all TCP 
or UDP protocol processing, but more substantial 
kernel revisions are needed to use them. Neither ap- 
proach by itself avoids the fundamental overheads 
of dat:a copying. 

Several known techniques can remove copies 
from the transport data path. Previous work has 
explored copy avoidance for TCP/IP communica- 
tion (Chase et. al. [8] provide a summary). Brus- 
toloni [5] introduced emulated copy, a scheme that 
avoids copying in network I/O while preserving copy 
semantics IQ-Lite [29] adds scatter/gather fea- 
tures to the I/O API and relies on support from 
the NIC to handle multiple client processes safely 
without copying. Another approach is to implement 
critical applications (e.g., Web servers) in the ker- 
nel [19]. Some of the advantages can be obtained 
more cleanly with combined data movement prim- 
itives, e.g., sendfile, which move data from stor- 
age directly to a network connection without a user 
space transfer; this is useful for file transfer in com- 
mon server applications. 


DAFS was introduced to combine the low over- 
head and flexibility of SAN products with the gen- 
erality of NAS file services. The DAFS approach to 
removing these overheads is to use a direct-access 
transport to read and write application buffers di- 
rectly. DAFS also enables implementation of the file 
system client at user level for improved efficiency, 
portability and application control. The next two 
sections discuss these aspects of DAFS in more de- 
tail. In Section 6, we discuss an alternative ap- 
proach that reduces NFS overhead by eliminating 
data copying. 


2.2 Direct-Access Transports 


Direct-access transports are characterized by 
NIC support for remote direct memory access 
(RDMA), user-level networking with minimal ker- 
nel overhead, reliable messaging transport connec- 
tions and per-connection buffering, and efficient 
asynchronous event notification. The Virtual Inter- 
face (VI) Architecture [12] defines a host interface 
and API for NICs supporting these features. 


Direct-access transports enable user-level net- 
working in which the user-mode process interacts 
directly with the NIC to send or receive messages 
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Figure 1: User-level vs. kernel-based client file sys- 
tem structure. 


with minimal intervention from the operating sys- 
tem kernel. The NIC device exposes an array of 
connection descriptors to the system physical ad- 
dress space. At connection setup time, the ker- 
nel network driver maps a free connection descrip- 
tor into the user process virtual address space, 
giving the process direct and safe access to NIC 
control registers and buffer queues in the descrip- 
tor. This enables RDMA, which allows the net- 
work adapter to reduce copy overhead by access- 
ing application buffers directly. The combination of 
user-level network access and copy avoidance has a 
lengthy heritage in rescarch systems spanning two 
decades [2, 4, 6, 33, 38]. 

The experiments in Section 7 quantify the im- 
provement in access overhead that DAFS gains from 
RDMA and transport offload on direct-access NICs. 


2.3. User-Level File Systems 


In addition to overhead reduction, the DAFS 
protocol leverages user-level networking to enable 
the network file system structure depicted in the 
left-hand side of Figure 1. In contrast to traditional 
kernel-based network file system implementations, 
as shown in the right side of Figure 1, DAFS file 
clients may run in user mode as libraries linked di- 
rectly with applications. 


While DAFS also supports’ kernel-based 
clients, our work focuses primarily on the properties 
of the user-level file system structure A user-level 
client yields additional modest reductions in over- 
head by removing system call costs. Perhaps more 
importantly, it can run on any operating system, 
with no special kernel support needed other than 
the NIC driver itself. The client may evolve in- 
dependently of the operating system, and multiple 
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client implementations may run on the same sys- 
tem. Most importantly, this structure offers an op- 
portunity to improve integration of file system func- 
tions with I/O-intensive applications. In particu- 
lar, it enables fully asynchronous pipelined file sys- 
tem access, even on systems with inadequate kernel 
support for asynchronous I/O, and it offers full ap- 
plication control over caching, data movement and 
prefetching. 


It has long been recognized that the kernel 
policies for file system caching and prefetching are 
poorly matched to the needs of some important ap- 
plications [34]. Migrating these OS functions into 
libraries to allow improved application control and 
specialization is similar in spirit to the library oper- 
ating systems of Exokernel [21], protocol service de- 
composition for high-speed networking [24], and re- 
lated approaches. User-level file systems were con- 
ceived for the SHRIMP pro ject [4] and the Network- 
Attached Secure Disks (NASD) project [18]. NFS 
and other network file system protocols could sup- 
port user-level clients over an RPC layer incorpo- 
rating the relevant features of DAFS [7], and we 
believe that our results and conclusions would ap- 
ply to such a system. 


Earlier work arguing against user-level file sys- 
tems [39] assumed some form of kernel mediation 
in the critical I/O path and did not take into ac- 
count the primary sources of overhead outlined in 
Section 2.1. However, the user-level structure con- 
sidered in this paper does have potential disadvan- 
tages. It depends on direct-access network hard- 
ware, which is not yet widely deployed. Although 
an application can control caching and prefetching, 
it does not benefit from the common policies for 
shared caching and prefetching in the kernel. Thus, 
in its simplest form, this structure places more bur- 
den on the application to manage data movement, 
and it may be necessary to extend applications to 
use a new file system API. Section 5 shows how 
this power and complexity can be encapsulated in 
prepackaged I/O adaptation libraries (depicted in 
Figure 1) implementing APIs and policies appro- 
priate for a particular class of applications. If the 
adaptation API has the same syntax and seman- 
tics as a pre-existing API, then it is unnecessary to 
modify the applications themselves (or the operat- 
ing system). 


3  DAFS Architecture and Standards 


The DAFS specification grew out of the DAFS 
Collaborative, an industry/academic consortium 
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led by Network Appliance and Intel. and it is 
presently undergoing standardization through the 
Storage Networking Industry Association (SNIA). 


The draft standard defines the DAFS proto- 
col [14] as a set of request and response formats 
and their semantics, and a recommended proce- 
dural DAFS APJ [13] to access the DAFS service 
from a client program. Because library-level com- 
ponents may be replaced, client programs may ac- 
cess a DAFS service through any convenient I/O 
interface. The DAFS API is specified as a recom- 
mended interface to promote portability of DAFS 
client programs. The DAFS API is richer and more 
complex than common file system APIs including 
the standard Unix system call interface. 


The next section gives an overview of the 
DAFS architecture and standards, with an empha- 
sis on the transport-related aspects: Sections 3.2 
and 3.3 focus on DAFS support for RDMA and 
asynchronous file I/O respectively. 


3.1 DAFS Protocol Summary 


The DAFS protocol derives from NFS Version 
4 [32] (NFSv4) but diverges from it in several sig- 
nificant ways. DAFS assumes a reliable network 
transport and offers server-directed command flow- 
control in a manner similar to block storage pro- 
tocols such as iSCSI. In contrast to NFSv4, every 
DAFS operation is a separate request, but DAFS 
supports request chaining to allow pipelining of de- 
pendent requests (e.g., a name lookup or open fol- 
lowed by file read). DAFS protocol headers are or- 
ganized to preserve alignment of fixed-size fields. 
DAFS also defines features for reliable session re- 
covery and enhanced locking primitives. To enable 
the application (or an adaptation layer) to sup- 
port file caching, DAFS adopts the NFSv4 mech- 
anism for consistent caching based on open delega- 
tions [1. 14, 32]. 

The DAFS specification is independent of the 
underlying transport, but its features depend on 
direct-access NICs. In addition, some transport- 
level features (e.g., message flow-control) are de- 
fined within the DAFS protoool itself, although they 
could be viewed as a separate layer below the file 
service protocol. 


3.2  Direct-Access Data Transfer 
To benefit from RDM A, DAFS supports direct 


variants of key data transfer operations (read, write, 
readdir, getattr, setattr). Direct operations transfer 
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directly to or from client-provided memory regions 
using RDMA read or w77te operations as described 
in Section 2. 2. 


The client must register each memory region 
with the local kernel before requesting direct I/O on 
the region. The DAFS API defines primitives to reg- 
ister and unregister memory regions for direct I/O; 
the register primitive returns a region descriptor to 
designate the region for direct I/O operations. In 
current implementations, registration issues a sys- 
tem call to pin buffer regions in physical memory, 
then loads page translations for the region into a 
lookup table on the NIC so that it may interpret 
incoming RDMA directives. To control buffer pin- 
ning by a process for direct I/O, the operating sys- 
tem should impose a resource limit similar to that 
applied in the case of the 4.4BSD mlock API [26]. 
Buffer registration may be encapsulated in an adap- 
tation library. 


RDMA operations for direct I/O in the DAFS 
protocol are always initiated by the server rather 
than a client. For example, to request a DAFS di- 
rect write, the client’s write request to the server 
includes a region token for the buffer containing the 
data. The server then issues an RDMA 7ead to fetch 
the data from the client, and responds to the DAFS 
write request after the RDMA completes. This al- 
lows the server to manage its buffers and control 
the order and rate of data transfer [27]. 


3.3. Asynchronous I/O and Prefetching 


The DAFS API supports a fully asynchronous 
interface, enabling clients to pipeline I/O operations 
and overlap them with application processing. A 
flexible event notification mechanism delivers asyn- 
chronous I/O completions: the client may create an 
arbitrary number of completion groups, specify an 
arbitrary completion group for each DAFS opera- 
tion and poll or wait for events on any completion 
group. 

The asynchronous I/O primitives enable event- 
driven application architectures as an alternative 
to multithreading. Event-driven application struc- 
tures are often more efficient and more portable 
than those based on threads. Asynchronous I/O 
APIs allow better application control over concur- 
rency, often with lower overhead than synchronous 
I/O using threads. 


Many NFS implementations support a limited 
form of asynchrony beneath synchronous kernel I/O 
APIs. Typically, multiple processes (called I’/O dae- 
mons or nfstods) issue blocking requests for sequen- 


tial block read-ahead or write-behind. Unfortu- 
nately, frequent nfsiod context switching adds over- 
head [2]. The kernel policies only prefetch after 
a run of sequential reads and may prefetch erro- 
neously if fiiture reads are not sequential. 


4 DAFS Reference Implementation 


We have built prototypes of a user-level DAFS 
client and a kernel DAFS server implementation for 
FreeBSD. Both sides of the reference implementa- 
tion use protocol stubs in a DAFS SDK provided 
by Network Appliance. The reference implementa- 
tion currently uses a 1.25 Gb/s Giganet cLAN VI 
interconnect. 


4.1 User-level Client 


The user-level DAFS client is based on a three- 
module design, separating transport functions, flow- 
control and protocol handling. It implements 
an asynchronous event-driven control core for the 
DAFS request/response channel protocol. The sub- 
set of the DAFS API supported includes direct and 
asynchronous variants of basic file access and data 
transfer operations. 


The client design allows full asynchrony for 
single-threaded applications. All requests to the 
library are non-blocking, unless the caller explic- 
itly requests to wait for a pending completion. The 
client polls for event completion in the context of 
application threads, in explicit polling requests and 
in a standard preamble/epilogue executed on ev- 
ery entry and exit to the library. At these points, 
it checks for received responses and may also initi- 
ate pending sends if permitted by the request flow- 
control window. Each thread entry into the library 
advances the work of the client. One drawback of 
this structure is that pending completions build up 
on the client receive queues if the application does 
not enter the library. However, deferring response 
processing in this case does not interfere with the 
activity of the client, since it is not collecting its 
completions or initiating new I/O. A more general 
approach was recently proposed for asynchronous 
application-level networking in Exokernel [16]. 


4.2 Kernel Server 


The kernel-based DAFS server [25] is a kernel- 
loadable module for FreeBSD 4.3-RELEASE that 
implements the complete DAFS specification. Us- 
ing the VFS/Vnode interface, the server may export 
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any local file system through DAFS. The kernel- 
based server also has an event-driven design and 
takes advantage of efficient hardware support for 
event notification and delivery in asynchronous net- 
work I/O. The server is multithreaded in order to 
deal with blocking conditions in disk I/O and buffer 
cache locking. 


5 Adaptation Libraries 


Adaptation libraries are user-level I/O li- 
braries that implement high-level abstractions, 
caching and prefetching, and insulate applications 
from the complexity of handling DAFS I/O. Adap- 
tation libraries interpose between the application 
and the file system interface (e.g., the DAFS API). 
By introducing versions of the library for each file 
system API, applications written for the library I/O 
API can run over user-level or kernel-based file sys- 
tems, as depicted in Figure 2. 

DAFS-based adaptation libraries offer an op- 
portunity to specialize file system functions for 
classes of applications. For example, file caching 
at, the application level offers three potential ben- 
efits. First, the application can access a_ user- 
level cache with lower overhead than a kernel-based 
cache accessed through the system call interface. 
Second, the client can efficiently use application- 
specific fetch and replacement policies. Third, in 
cases where caching is an essential function of the 
adaptation library, a user-level file system avoids 
the problem of double caching, in which data is 
cached redundantly in the kernel-level and user- 
level caches. 

One problem with user-level caching is that 
the kernel virtual memory system may evict cached 
pages if the dient cache consumes more memory 
than the kernel allocates to it. For this reason, each 
of these adaptation libraries either pre-registers its 
cache (as described in Section 3.2) or configures it 
to a “safe” size. A second problem is that user-level 
caches are not easly shared across multiple uncoop- 
erating applications, but the need for this sharing 
is less common in high-performance domains. 

To illustrate the role of adaptation libraries, 


we consider two examples that we enhanced for use 
with DAFS:; TPIE and Berkeley DB. 


5.1 TPIE 
TPIE [37] (Transparent Parallel I/O Environ- 


ment) is a toolkit for external memory (EM) al 
gorithms. EM algorithms are structured to handle 
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Figure 2: Adaptation Libraries can benefit from 
user-level clients without modifying applications. 


massive data problems effidently by minimizing the 
number ofI/O operations. They enhance locality of 
data accesses by performing I/O in large blocks and 
maximizing the useful computation on each block 
while it is in memory. The TPIE toolkit supports 
a range of applicat.ions including Geographic Infor- 
mation System (GIS) analysis programs for massive 
terrain grids [3]. 

To support EM algorithms, TPIE implements 
a dataflow-like streaming paradigm on collections 
of fixed-size records. It provides abstract stream 
types with high-level interfaces to “push” streams of 
data records through application-defined record op- 
erators. A pluggable Block Transfer Engine (BTE) 
manages transfer buffering and provides an inter- 
face to the underlying storage system. We intro- 
duced a new BTE for the DAFS interface, allowing 
us to run TPIE applications over DAFS without 
modification. The BTE does read-ahead and write 
behind on data streams using DAFS asynchronous 
primitives, and handles the details of block cluster- 
ing and memory registration for direct I/O. 


5.2 Berkeley DB 


Berkeley DB (db) [28} is an open-source embed- 
ded database system that provides library support 
for concurrent storage and retrieval of key/value 
pairs. Db manages its own buffering and caching, in- 
dependent of any caching at the underlying file sys- 
tem buffer cache. Db can be configured to use a spe- 
cific page size (the unit of locking and I/O, usually 
8B) and buffer pool size. Running db over DAFS 
avoids double caching and bypasses the standard file 
system prefetching heuristics, which may degrade 
performance for common db access patterns. Sec- 
tion 7.4 shows the importance of these effects for db 
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performance. 
6 Low-Overhead Kernel-Based NFS 


The DAFS approach is one of several alterna- 
tives to improving access performance for network 
storage. In this section we consider a prominent 
competing structure as a basis for the empirical 
comparisons in Section 7. This approach enhances 
a kerne-tbased NFS client to reduce overhead for 
protocol processing and/or data movement. While 
this does not reduce system-call costs, there is no 
need to modify existing applications or even to re 
link them if the kernel API is preserved. However, 
it does require new kernel support, which is a bar- 
rier to fast and reliable deployment. Like DAFS, 
meaningful NFS enhancements of this sort also rely 
on new support in the NIC. 


The most general form of copy avoidance for 
file services uses header splitting and page flipping, 
variants of which have been used with TCP/IP pro- 
tocols for more than a decade (e.g.. [5, 8, 11, 35]). 
To illustrate, we briefiy describe FreeBSD enhance- 
ments to extend copy avoidance to read and write 
opcrations in NFS. Most NFS implementations send 
data directly from the kernel file cache without mak- 
ing acopy, so a client initiating a write and a server 
responding to a read can avoid copies. We focus on 
the case of a client recei ving a read response contain- 
ing a block of data to be placedin the file cache. The 
key challenge is to arrange for the NIC to deposit 
the data payload—the file block—page-aligned in 
one or more physical page frames. These pages may 
then be inserted into the file cache by reference, 
rather than by copying. It is then straightforward 
to deliver the data to a user process by remapping 
pages rather than by physical copy. but only if the 
uscr’s buffers are page-grained and suitably aligned. 
This also assumes that the file system block size is 
an integral multiple of the page size. 

To do this, the NIC first strips off any trans- 
port headers and the NFS header from each mes- 
sage and places the data in a separate page-aligned 
buffer (header splitting). Note that. if the network 
MTU is smaller than the hardware page size, then 
the transfer of a page of data is spread across multi- 
ple packets, which can arrive at the receiver out-of- 
order and/or interspersed with packets from other 
flows. In order to pack the data conti guously into 
pages, the NIC must do significant protocol process- 
ing for NFS and its transport to decode the incom- 
ing packets. NFS complicates this processing with 
variable-length headers. 


We modified the firmware for Alteon Tigon- 
II Gigabit Ethernet adapters to perform header 
splitting for NFS read response messages. This 
is sufficient to implement a zcro-copy NFS client. 
Our modifications apply only when the transport is 
UDP/IP and the network is configured for Jumbo 
Frames, which allow NFS to exchange data in units 
of pages. To allow larger block sizes, we altered IP 
fragmentation code in the kernel to avoid splitting 
page buffers across fragments of large UDP packets. 
Together with other associated kernel support in 
the file cache and VM system, this allows zero-copy 
data exchange with NFS block-transfer sizes up to 
32KB. Large NFS transfer sizes can reduce over- 
head for bulk data transfer by limiting the number 
of trips through the NFS protocol stack; this also 
reduces transport overheads on networks that allow 
large packets. 

While this is not a general solution, it allows 
us to assess the performance potential of optimiz- 
ing a kernel-based file system rather than adopt- 
ing a direct-access user-level file system architecture 
like DAFS. It also approximates the performance 
achievable with a kernel-based DAFS client, or an 
NFS implementation over VI or some other RDMA- 
capable network interface. As a practical matter, 
the RDMA approach embraced in DAFS is a more 
promising alternative to low-overhead NFS. Note 
that page flipping NFS is much more difficult over 
TCP, because the NIC must buffer and reassemble 
the TCP stream to locate NFS headers appearing 
at arbitrary offsets in the stream. This is possible in 
NICs implementing a TCP offload engine, but im- 
practical in conventional NICs such as the Tigorr II. 


7 Experimental Results 


This section presents performance results from 
a range of benchmarks over our DAFS reference im- 
plementation and two kernel-based NFS configura- 
tions. The goal of the analysis is to quantify the 
effects of the various architectural features we have 
discussed and understand how they interact with 
properties of the workload. 


Our system configuration consists of Pentium 
III 800MHz clients and servers with the Server- 
Works LE chipset, equipped with 256MB-1GB of 
SDRAM on a 133 MHz memory bus. Disks are 
9GB 10000 RPM Seagate Cheetahs on a 64-bit /33 
MHz PCI bus. All systems run patched versions 
of FreeBSD 4.3-RELEASE. DAFS uses VI over Gi- 
ganet cLAN 1000 adapters. NFS uses UDP/IP over 
Gigabit Ethernet, with Alteon Tigon-II adapters. 
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Table 1: Baseline Network Performance. 


VI/cLAN UDP/Tigon-II 
Latency 30 jus 132 ps 
Bandwidth 113 MB/s 120 MB/s 


Table 1 shows the raw one-byte roundtrip latency 
and bandwidth characteristics of these networks. 
The Tigon-II has a higher latency partly due to the 
datapath crossing the kernel UDP/IP stack. The 
bandwidths are comparable, but not identical. In 
order to best compare the systems, we present per- 
formance results in Sections 7.1 and 7.2 normalized 
to the maximum bandwidth achievable on the par- 
ticular technology. 


NFS clients and servers exchange data in units 
of 4KB, 8KB, 16KB or 32KB (the NFS block I/O 
transfer size is set at mount timc), sent in frag- 
mented UDP packets over 9000-byte MTU Jumbo 
Ethemet frames to minimize data transfer over- 
heads. Checksum offloading on the Tigon-II is en- 
abled, minimizing checksum overheads. Interrupt 
coalescing on the Tigon-II was set as high as pos- 
sible without degrading the minimum one-way la- 
tency of about 66;:s for one-byte messages. NFS 
client (nfstod) and server (nfs) concurrency was 
tuned for best performance in all cases. 


For NFS experiments with copy avoidance 
(NF'S-nocopy), we modified the Tigon-II firmware, 
IP fragmentation code, file cache code, VM system 
and Tigon-II driver for NFS/UDP header splitting 
and page remapping as described in Section 6. This 
configuration is the state of the art for low-overhead 
data transfer over NFS. Experiments with the stan- 
dard NFS implementation (N/‘S) use the standard 
Tigon-IT driver and vendor firmware. 


7.1 Bandwidth and Overhead 


We first explore the bandwidth and client CPU 
overhead for reads with and without read-ahead. 
These experiments use a 1GB server cache pre 
warmed with a 7G8MB dataset. For this exper- 
iment, we factor out client caching as the NFS 
client. cache is too small to be effective and the 
DAFS client does not cache. Thus these expen- 
ments are designed to stress the network data trans- 
fer. These results are representative of workloads 
with sequential I/O on large disk arrays or asyn- 
chronous random-access loads on servers with suffi- 
caent disk arms to deliver data to the client at net- 
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work speed. None of the architectural features dis- 
cussed yield much benefit for workloads and servers 
that are disk-bound, as they have no effect on the 
remote file system or disk system performance, as 
shown in Section 7.4. 


The application’s request size for each file I/O 
request is denoted block size in the figures. The ideal 
block size depends on the nature of the application; 
large blocks are preferable for applications that do 
long sequential reads, such as streaming multime 
dia, and smaller blocks are useful for nonsequential 
applications, such as databases. 


In the read-ahead (sequential) configurations, 
the DAFS client uses the asynchronous I/O API 
(Section 3.3), and NFS has kemel-based sequen- 
tial read-ahead enabled. For the experiments with- 
out read-ahead (random access), we tuned NFS for 
best-case performance at each request size (block 
size). For request sizes up to 32K, NFS is config- 
ured for a matching NFS transfer size, with read- 
ahead disabled. This avoids unnecessary data trans- 
fer or false prefetching. For larger request sizes we 
used an NFS transfer size of 32K and implicit read- 
ahead up to the block size. One benefit of the user- 
level file system structure is that it allows clients 
to select transfer size and read-ahead policy on a 
per-application basis. All DAFS experiments use a 
transfer size equal to the request block size. 


Random access reads with read-ahead 
disabled. The left graphs in Figure 3 and Fig- 
ure 4 reports bandwidth and CPU utilization, re- 
spectively, for random block reads with read-ahead 
disabled. All configurations achieve progressively 
higher bandwidths with increasing block size, since 
the wire is idle between requests. The key observa- 
tion here is that with small request sizes the DAFS 
configuration outperforms both NFS implementa- 
tions by a factor of two to three. This is due to 
lower operation latency, stemming primarily from 
the lower network latency (see Table 1) but also 
from the lower per-I/O overhead. This lower over- 
head results from the transport protocol offload to 
the direct-access NIC, including reduced system-call 
and interrupt overhead possible with user-level net- 
working. 

With large request sizes, transfer time dom- 
inates. NFS peaks at less than one-half the link 
bandwidth (about 60 MB/s), limited by memory 
copying overhead that saturates the client CPU. 
DAFS achieves wire speed using RDMA with low 
clint CPU utilization, since the CPU is not in- 
volved in the RDMA transfers. NFS-nocopy elimi- 
nates the copy overhead with page flipping; it also 
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Figure 3: Read bandwidth without (left) and with (right) reac-ahead. 
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Figure 4: Read client CPU utilization (%) 


approaches wire speed for large block sizes, but the 
overhead for page flipping and kernel protocol code 
consumes 50% of the client CPU at peak band 
width. 


Sequential reads with read-ahead. The 
right graphs in Figure 3 and Figure 4 report band- 
widths and CPU utilization for a similar experiment 
using sequential reads with read-ahead enabled. In 
this case, all configurations reach their peak band- 
width even with small block sizes. Since the band- 
widths are roughly constant, the CPU utilization 
figures highlight the differences in the protocols. 
Again, when NFS peaks, the client CPU is satu- 
rated; the overhead ts relatively insensitive to block 
size because it 1s dominated by copying overhead. 


Both NFS-nocopy and DAFS avoid the byte 
copy overhead and exhibit lower CPU utilization 
than NFS. However, while CPU utilization drops 
off with increasing block size, this drop is signif- 
icant for DAFS, but noticeably less so for NFS- 
nocopy. The NFS-nocopy overhead is dominated by 
page flipping and transport protocol costs, both of 
which are insensitive to block size changes beyond 
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without (left) and with (right) read-ahead. 


the page size and network MTU size. In contrast, 
the DAFS overhead is dominated by request initia- 
tion and response handling costs in the file system 
client code, since the NIC handles data transport 
using RDMA. Therefore, as the number of requests 
drops with the increasing block size, the client CPU 
utilization drops as well. 

The folowing cxpenments illustrate the im- 
portance of these factors for application perfor- 
mance. 


7.2 TPIE Merge 


This expenment combines raw sequential 
I/O performance, including writes, with varying 
amounts of application processing. As in the 
previous expenment, we configure the system to 
stress client data-transfer overheads, which dom+- 
nate when the server has adequate CPU and I/O 
bandwidth for the application. In this case the 
I/O load is spread across two servers using 600MB 
memory-based file systems. Performance is limited 
by the dient CPU rather than the server CPU, net- 
work, or disk I/O. 
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Figure 5: TPIE Merge throughput for n = 8. 


The benchmark is a TPIE-based sequential 
record Merge program, which combines n sorted in- 
put files of z ybyte records, each with a fixed-size 
key (an integer), into one sorted output file. Per- 
formance is reported as total throughput: 


2-n-a2-y (bytes) 
t (sec) 





This experiment shows the effect of low- 
overhead network I/O on real application perfor- 
mance, since the merge processing competes with 
I/O overhead for client CPU cycles. Varying the 
merge order n and/or record size y allows us to 
control the amount of CPU work the application 
performs per block of data. CPU oost per record 
(key comparisons) increases logarithmically with 7; 
CPU cost per byte decreases linearly with record 
size. This is because larger records amortize the 
comparison cost across a larger number of bytes, 
and there are fewer records per block. 


We ran the Aferge program over two variants 
of the TPIE library, as described in Section 5. 
One variant (TPIE/DAFS) is linked with the user- 
level DAFS client and accesses the servers using 
DAFS. In this variant, TPIE manages the stream- 
ing using asynchronous I/O, with zero-copy reads 
using RDMA and zero-copy writes using inline 
DAFS writes over the cLAN’s scatter/gather mes- 
saging (CLAN does not support DAFS writes using 
RDMA). The second variant (TPIE/NFS) is config- 
ured to use the standard kernel file system interface 
to access the servers over NFS. For TPIE/NFS, we 
ran experiments using both standard NFS and NFS- 
nocopy configurations, as in the previous section. 
For the NFS configurations, we tuned read-ahead 
and write-behind concurrency (nfsiods) for the best 
performance in all cases) The TPIE I/O request 
size was fixed to 32KB. 


Figure 5 shows normalized merge throughput 
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results; these are averages over ten runs with a vari- 
ance of less than 2% of the average. The merge or- 
der is n=8. The record size varies on the z-axis, 
showing the effect of changing the CPU demands 
of the application. The results show that the lower 
overhead of the DAFS client (noted in the previ- 
ous section) leaves more CPU and memory cycles 
free for the application at a given I/O rate, result- 
ing in higher merge throughputs. Note that the 
presence of application processing accentuates the 
gap relative to the raw-bandwidth tests in the pre- 
vious subsection. It is easy to see that when the 
client CPU is saturated, the merge throughput is 
inversely proportional to the total processing time 
per block, i.e., the sum of the total per-block I/O 
overhead and application processing time. 


For example, on the right side of the graph, 
where the application has the highest I/O demand 
(due to larger records) and hence the highest I/O 
overhead, DAFS outperforms NFS-nocopy by as 
much as 40%, as the NFS-nocopy client consumes 
up to 60% of its CPU in the kernel executing pro- 
tocol code and managing I/O. Performance of the 
NFS configuration is further limited by memory 
copying through the system-call interface in writes. 


An important point from Figure 5 is that 
the relative benefit of the low-overhead DAFS con- 
figuration is insignificant when the application is 
compute-bound. As application processing time per 
block diminishes (from left to right in Figure 5), re- 
ducing I/O overhead yields progressively higher re- 
turns because the I/O overhead is a progressively 
larger share of total CPU time. 


7.8 PostMark 


PostMark [22] is a synthetic benchmark aimed 
at measuring file system performance over a work- 
load composed of many short-lived, relatively small 
files. Such a workload is typical of mail and net- 
news servers used by Internet Service Providers. 
PostMark workloads are characterized by a mix of 
metadata-intensive operations. The benchmark be- 
gins by creating a pool of files with random sizes 
within a specified range. The number of files, as 
well as upper and lower bounds for file sizes, are 
configurable. After creating the files, a sequence of 
transactions is performed. These transactions are 
chosen randomly from a file creation or deletion op- 
eration paired with a file read or write. A file cre- 
ation operation creates and writes random text toa 
file. File deletion removes a random file from the ac- 
tive set. File read reads a random file in its entirety 
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and file write appends a random amount of data to 
a randomly chosen file. In this section, we consider 
DAFS as a high-performance alternative to NFS for 
deployment in mail and netnews servers [9, 10]. 


We compare PostMark performance over 
DAFS to NFS-nocopy. We tune NFS-nocopy to ex- 
hibit best-case performance for the average file size 
used each time. For file sizes less than or equal to 
32K, NFS uses a block size equal to the file size (to 
avoid read-ahead). For larger file sizes, NFS uses 
32K blocks, and read-ahead is adjusted according to 
the file size. Read buffers are page-aligned to enable 
page remapping in delivering data to the user pro- 
cess. In all cases an FFS file system is exported us- 
ing soft updates [17] to eliminate synchronous disk 
I/O for metadata updates. 


Our NFS-nocopy implementation is based on 
NFS Version 3 [30]. It uses write-behind for file 
data: writes are delayed or asynchronous depend- 
ing on the size of the data written. On close, the 
NFS client flushes dirty buffers to server memory 
and waits for flushing to complete but does not 
commit them to server disk. NFS Version 3 open- 
to-close consistency dictates that cached file blocks 
be re-validated with the server each time the file is 
opened!. 

With DAFS, the client does not cache and all 
I/O requests go to the server. Writes are not syn- 
chronously committed to disk, offering a data relia- 
bility similar to that provided by NFS. DAFS inlines 
data with the write RPC request (since the cLAN 
cannot support server-initiated RDMA read opera- 
tions required for direct file writes) but uses direct 
I/O for file reads. In both cases, the client waits for 
the RPC response. 


A key factor that determines PostMark per- 
formance is the cost of metadata operations (e.g. , 
open, close, create, delete). Enabling soft updates 
on the server-exported filesystem decouples meta- 
data updates from the server disk 1/O system. This 
makes the client metadata operation cost sensitive 
to the client-server RPC as well as to the metadata 
update on the server filesystem. As the number 
of files per directory increases, the update time on 
the server dominates because FFS performs a linear 
time lookup. Other factors affecting performance 
are client caching and network I/O. 


In order to better understand the PostMark 


1In the NFS implementation we used, writing to a file be- 
fore closing it originally resulted in an invalidation of cached 
blocks in the next open of that file, as the client could not 
tell who was responsible for the last write. We modified our 
implementation to avoid this problem. 
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Figure 6: PostMark. Effect of I/O boundedness. 


performance, we measured the latency of the micro- 
operations involved in each PostMark transaction. 
As we saw in Table 1, the Tigon-II has significantly 
higher roundtrip latency than the cLAN. As a re- 
sult, there is a similar difference in the null-RPC 
cost which makes the cLAN RPC three times faster 
than Tigon-II RPC (47ps vs. 154ys). In addition, 
all DAFS file operations translate into a single RPC 
to the server. With NFS, a file create or remove 
RPC is preceded by a second RPC to get the at- 
tributes of the directory that contains the file. Sim- 
ilarly, a file open requires an RPC to validate the 
file’s cached blocks. The combination of more ex- 
pensive and more frequent RPCs introduces a sig- 
nificant performance differential between the two 
systems. 


This experiment measures the effect of increas- 
ing I/O boundedness on performance. We increase 
the average file size from 4KB to 64KB, maintain- 
ing a small number of files (about 150) to mini- 
mize the server lookup time. Each run performs 
30,000 transactions, each of which is a create or 
delete paired with a read operation. We report the 
average of ten runs, which have a variance under 
10% of the average. The client and server have 
256MB and 1GB of memory, respectively. At small 
file sizes, the increased cost of metadata operations 
for NFS-nocopy dominates its performance. As the 
I/O boundedness of the workload increases, DAFS 
performance becomes dominated by network I/O 
transfers and drops linearly. For large file sizes, 
reads under NFS-nocopy benefit from caching, but 
writes still have to go to the server to retain open- 
to-close semantics. 


This experiment shows that DAFS outper- 
forms NFS-nocopy by more than a factor of two 
for small files due largely to the lower latency of 
metadata operations on the cLAN network. Part of 
this difference could be alleviated by improvements 
in networking technology. However, the difference 


General Track: 2002 USENIX Annual Technical Conference 





11 


ee 


12 


50000 
DAFS -—— 


NFS-nocopy -.-6-- 





40000 


30000 


20000 


Transactlions/sec 


10000 


Q . 1 — —- t _ 
128 256 512 
Dataset Size in MB (log scale) 


Figure 7: Berkeley DB. Effect of double caching 
and remote memory access performance. 


in the number of RPCs between DAFS and NFS is 
fundamental to the protocols. For larger files, the 
NFS-nocopy benefit from client caching is limited 
by its consistency model that requires waiting on 
outstanding asynchronous writes on file close 


7.4 Berkeley DB 


In this experiment, we use a synthetic work- 
load composed of read-only transactions, each ac- 
cessing one small record uniformly at random from 
a B-tree to compare db performance over DAFS to 
NFS-nocopy. The workload is single-threaded and 
read-only, so there is no logging or locking. In all 
experiments, after warming the db cache we per- 
formed a scquence of read transactions long enough 
to ensure that each record in the database is touched 
twice on average. The results report throughput in 
transactions per second. The unit of I/O is a 16KB 
block. 


We vary the size of the db working set in or- 
der to change the bottleneck from local memory, 
to remote memory. to remote disk I/O. We com- 
pare a DAFS db client to an NFS-nocopy db client 
both running on a machine with 256MB of physical 
memory. In both cases the server runs on a machine 
with 1GB of memory. Since we did not expect read- 
ahead to help in the random access pattern consi d- 
ered here, we disable read-ahead for NFS-nocopy 
and use a transfer size of 16K. The dd user-level 
cache size is set to the amount of physical memory 
expected to be available for allocation by the user 
process. The DAFS client uses about 36MB for 
communication buffers and statically-sized struc- 
tures leaving about 190MB to the db cache To 
facilitiate comparison between the systems, we con- 
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figure the cache identically for NFS-nocopy. 


Figure 7 reports throughput with warm db 
and server cache. In NFS-nocopy, reading through 
the file system cache creates competition for phys- 
ical memory between the uscr-level and file system 
caches (Section 5). For database sizes up to the 
size of the db cache, the user-level cache is able to 
progressively use more physical memory during the 
warming period, as network I/O diminishes. Perfor- 
mance is determined by local memory access as db 
eventually satisfies requests entirely from the local 
cache. 


Once the database size exceeds the client 
cache, perforinance degrades as both systems start 
accessing remote memory. NFS-nocopy perfor- 
mance degrades more sharply due to two effects. 
First, the double caching cffect creating competi- 
tion for physical memory between the user-level and 
file system caches is now persistent due to increased 
network I/O demands. As a result, the filesystem 
cache grows and user-level cache memory is paged 
out to disk causing future page faults Second, since 
the db API issues unaligned page reads from the 
file system, NFS-nocopy cannot use page remap- 
ping to deliver data to the user process. The DAFS 
client avoids these effects by maintaining a single 
client cache and doing direct block reads into dé 
buffers. For database sizes larger than 1GB that 
cannot fit in the server cache, both systems are disk 
I/O bound on the server. 


8 Conclusions 


This paper explores the key architectural fea- 
tures of the Direct Access File System (DAFS), 
a new architecture and protocol for network- 
attached storage over direct-access transport net- 
works. DAFS or other approaches that exploit such 
networks and user-level file systems have the po- 
tential to close the performance gap between full 
featured network file services and network storage 
based on block access models. 


The contribution of our work is to characterize 
the issues that determine the effects of DAFS on ap- 
plication performance, and to quantify these effects 
using experimental results from a public DAFS ref- 
erence implementation for an open-source Unix sys- 
tem (FreeBSD). For comparison, we report results 
from an experimental zero-copy NFS implementa- 
tion. This allows us to evaluate the sources of the 
performance improvements from DAFS (e.g., copy 
overhead vs. protocol overhead) and the alterna- 
tives for achieving those benefits without DAFS. 
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DAFS offers significant overhead reductions for 
high-speed data transfer. These improvements rec- 
sult primarily from direct access and RDMA, the 
cornerstones of the DAFS design, and secondarily 
from the effect of transport offload to the network 
adapter. These benefits can yield significant ap- 
plication improvements. In particular. DAFS de- 
livers the strongest benefit for balanced workloads 
in which application processing saturates the CPU 
when I/O occurs at network speed. In such a sce- 
nario, DAFS improves application performance by 
up to 40% over NFS-nocopy for TPIE Merge. How- 
ever, many workload factors can undermine these 
benefits. Direct-access transfer yields little benefit 
with servers that are disk-limited. or with workloads 
that are heavily compute- bound. 


Our results also show that I/O adaptation li- 
braries can obtain benefits from the DAFS user- 
level client architecture, without the need to port 
applications to a new (DAFS) API. Most impor- 
tantly, adaptation libraries can leverage the addi- 
tional control over concurrency (asynchrony), data 
movement. buffering, prefetching and caching in 
application-specific ways, without burdening appli- 
cations. This creates an opportunity to address 
longstanding problems related to the integration of 
the application and file system for high-performance 
applications. 
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10 Software Availability 


The DAFS and NFS-nocopy implementations 
used in this paper are available from http: // 
www.eecs.harvard.edu/vino/fs-perf/dafs and 
http: //www.cs.duke.edu/ari/dafs. 
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Abstract 
The rapidly declining cost of persistent RAM 
technologies prompts the question of when, not 


whether, such memory will become the preferred 
storage medium for many computers. Conquest is a file 
system that provides a transition from disk to persistent 
RAM as the primary storage medium. Conquest 
provides two specialized and simplified data paths to 
in-core and onmdisk storage, and Conquest realizes 
most of the benefits of persistent RAM at a fractional 
cost of a RAM-only solution. As of October 2001, 
Conquest can be used effectively for a hardware cost of 
under $200. 

We compare Conquest’s performance to ext2, 
reiserfs, SGI XFS, and ramfs, using popular 
benchmarks. Our measurements show that Conquest 
incurs little overhead compared to ram fs. Compared to 
the disk-based file systems, Conquest achieves 24% to 
1900% faster memory performance, and 43% to 96% 
faster performance when exercising both memory and 
disk, 


1 Introduction 


For over 25 years, long-term storage has been 
dominated by rotating magnetic media. At the 
beginning of the disk era, tapes were still widely used 
for online storage; today, they are almost exclusively 
used for backup despite still being cheaper than disks. 
The reasons are both price threshold and performance: 
although disks are more expensive, they are cheap 
enough for common use, and their performance is 
vastly superior. 

Today, the rapidly dropping price of RAM 
suggests that a similar transition may soon take place, 
with all-electronic technologies gradually replacing 
disk storage. This transition is already happening in 
portable devices such as cameras, PDAs, and MP3 
players. Because rotational delays are not relevant to 
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persistent RAM storage, it 1s appropriate to consider 
whether existing file system designs are suitable in this 
new environment. 

The Conquest file system is designed to address 
these questions and to smooth the transition from disk- 
based to persistent-RAM-based storage. Unlike other 
memory file systems [21, 10, 43], Conquest provides an 
incremental ‘solution that assumes more file system 
responsibility in-core as memory prices decline. Unlike 
HeRMES [25], which deploys a relatively modest 
amount of persistent RAM to alleviate disk traffic, 
Conquest assumes an abundance of RAM to perform 
most file system functions. In essence, Conquest 
provides two specialized and simplified data paths to 
in-core and on-disk storage. Conquest achieves most of 
the benefits of persistent RAM without the full cost of 
RAM-only solutions. As persistent RAM becomes 
cheaply abundant, Conquest can realize more additional 
benefits incrementally. 


2 Alternatives to Conquest 


Given the promise of using increasingly cheap memory 
to improve file systems performance, it would be 
desirable to do so as simply as possible. However, the 
obvious simple methods for gaining such benefits fail to 
take complete advantage of the new possibilities. in 
many cases, extensions to the simple methods can give 
results similar to our approach, but to make these 
extensions, so much complexity must be added that 
they are no longer attractive alternatives to the 
Conquest approach. 

In this section, we will discuss the limitations of 
these alternatives. Some do not provide the expected 
performance gains, while others do not provide a 
complete solution to the problem of storing arbitrary 
amounts of data persistently, reliably, and conveniently. 
Rather than adding the complications necessary to fix 
these approaches, it 1s better to start the design with a 
clean slate. 
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2.1 Caching 

One alternative to a hybrid RAM-based file system like 
Conquest is instead to take advantage of the existing 
file buffer cache. Given that a computer has an ample 
amount of RAM, why not just allocate that RAM to a 
buffer cache, rather than dedicating it to a file storage 
system? This approach seems especially appropriate 
because the buffier cache tends to populate itself with 
the most frequently referenced files, rather than wasting 
space on files that have been untouched for lengthy 
periods. 

However, using the buffer cache has several 
drawbacks. Roselli et al., [34] showed that caching 
often experiences diminishing marginal returns as the 
size of cache grows larger. They also found that caches 
could experience miss rates as high as 10% for some 
workloads, which is enough to reduce performance 
significantly. 

Another challenge is handling cache pollution, 
which can have a variety of causes—reading large files, 
buffering asynchronous writes, daily backups, global 
searches, disk maintenance utilities, etc. This problem 
led to remedies such as LFU buffer replacement for 
large files or attempts to reduce cache-miss latency by 
modifying compilers [39], placing the burden on the 
programmer [31], or constructing user behavior- 
analysis mechanisms within the kernel [15, 19]. 

Caches also make it difficult to maintain data 
consistency between memory and disk. A classic 
example is metadata commits, which are synchronous 
in most file systems. Asynchronous solutions do exist, 
but at the cost of code complexity [12, 38]. 

Moving data between disk and memory can 
involve remarkably complex management. For 
example, moving file data from disk to memory 
involves locating the metadata, scheduling the metadata 
transfer to memory, translating the metadata into 
runtime form, locating data and perhaps additional 
metadata, scheduling the data transfer, and reading the 
next data block ahead of time. 

Conquest fundamentally. differs from caching by 
not treating memory as a scarce resource. Instead, 
Conquest anticipates the abundance of cheap persistent 
RAM and uses disk to store the data well suited to disk 
characteristics. We can then achteve simpler disk 
optimizations by narrowing the range of access patterns 
and characteristics anticipated by the file system. 

2.2 RAM Drives and RAM File Systems 

Many computer scientists are so used to disk storage 
that we sometimes forget that persistence is not 
automatic. In addition to the storage medium, 
persistence also requires a protocol for storing and 
retrieving the information from the persistent medium, 
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so that a file system can survive reboots. While 
persistent RAM provides nonvolatility of memory 
content, the file system and the memory manager also 
need to know how to take advantage of the storage 
medium. 

Most RAM disk drivers operate by emulating a 
physical disk drive. Although there is a file system 
protocol for storing and retrieving the in-memory 
information, there is no protocol to recover the 
associated memory states. Given that the existing 
memory manager is not aware of RAM drives, isolating 
these memory states for persistence can be nontrivial. 

RAM file systems under Linux and BSD [21] use 
the 1O caching infrastructure provided by VFS to store 
both metadata and data in various temporary caches 
directly. Since the memory manager 1s unaware of 
RAM file systems, neither the file system nor the 
memory states survive reboots without significant 
modifications to the existing memory manager. 

Both RAM drives and RAM file systems also incur 
unnecessary disk-related overhead. For RAM drives, 
existing file systems, tuned for disk, are installed on the 
emulated drive without regard for the absence of the 
mechanical limitations of disks. For example, access to 
RAM drives is done in blocks, and the file system will 
still waste effort attempting to place files in "cylinder 
groups” even though cylinders and block boundaries no 
longer exist. Although RAM file systems have 
eliminated some disk-related complexities, many RAM 
file systems rely on VFS and its generic storage access 
routines; many built-in mechanisms such as readahead 
and buffer-cache_ reflect assumptions that the 
underlying storage medium is slower than memory. 

In addition, both RAM drives and RAM file 
systems limit the size of the files they can store to the 
size of main memory. These restrictions have limited 
the use of RAM disks to caching and temporary file 
systems. To move to a general-purpose persistent- 
RAM file system, we need a substantially new design. 
2.3 Disk Emulators 
Some manufacturers advocate RAM-based disk 
emulators for specialty applications [44]. These 
emulators generally plug into a standard SCSI or 
similar IO port, and look exactly like a disk drive to the 
CPU. Although they provide a convenient solution to 
those who need an instant speedup, and they do not 
suffer the persistence problem of RAM disks, they 
again are an interim solution that does not address the 
underlying problem and does not take advantage of the 
unique benefits of RAM. In addition, standard IO 
interfaces force the emulators to operate through 
inadequate access methods and low-bandwidth cables, 
greatly limiting the utility of this option [33] as 
something other than a stopgap measure. 
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2.4 Ad Hoc Approaches 
There are abo a number of less structured approaches 
to using existing tools to exploit the abundance of 
RAM. For example, one could achieve persistence by 
manually transferring files into ramfs at boot time and 
preserving them again before shutdown. However, this 
method would drastically limit the total file system size. 
Another option is to attempt to manage RAM space 
by using a background daemon to stage files to a disk 
partition. Although this could be made to work, it 
would require significant additional complexity to 
maintain the single name space provided by Conquest 
and to preserve the semantics of symbolic and hard 
links when moving files between storage media. 


3 Conquest File System Design 


Our initial design assumes the popular single-user 
desktop environment with | to 4 GB of persstent 
RAM, whth 1s affordable today. As of October 2001, 
we can add 2 GB of battery-backed RAM to our 
desktop computers and deploy Conquest for under $200 
[32]. Extending our design to other environments will 
be future work. 

We will first present the design of Conquest, 
followed by a discussion of major design decisions. 


3.1 File System Design 


In our current design, Cenguest stores all small files, 
metadata, executables, and shared libraries in persstent 
RAM; disks hold only the data content of remaining 
large files. We will discuss this media usage strategy 
further in Section 3.2. 

An in-core file is stored logically contiguously in 
persistent RAM. Disks store only the data content of 
large files with coarse granularity, thereby reducing 
management overhead. For each large file, Conquest 
maintains a segment table in persistent RAM. On-disk 
allocation is done contiguously whenever possible in 
temporal order, similar to LFS [35] but without the 
need to perform continuous disk cleaning in the 
background. 

For cach directory, Conquest maintains a variant of 
an extensible hash table for its file metadata entries, 
with file names as keys. Hard links are supported by 
allowing multiple names (potentially under different 
directories) to hashto the same file metadata entry. 

RAM storage allocation uses existing mechanisms 
in the memory manager when possible to avoid 
duplicate functionality. For example, the storage 
manager is relieved of maintaining a _ metadata 


allocation table and a free list by using the memory 
address of the file metadata as its unique ID. 

Although it reuses the code of the existing memory 
manager, Conquest has its own dedicated instances of 
the manager, residing persistently inside Conquest, each 
governing its Own memory region. Paging and 
Swapping are disabled for Conquest memory, but 
enabled for the non-Conquest memory region. 

Unlike caching, RAM drives, and RAM file 
systems, Conquest memory is the final storage 
destination for many files and all metadata. We can 
access the critical path of Conquest’s main store 
without disk-related complexity in data duplication, 
migration, translation, synchronization, and associated 
management. Unlike RAM drives and RAM file 
systems, Conquest provides persstence and storage 
Capacity beyond the sze limitation of the physical main 
store. 


3.2 Media-Usage Strategy 


The first major design decision of Conquest is the 
choice of which data to place on disk, and the answer 
depends on the characteristics of popular workloads. 
Recent studies [9, 34, 42] independently confirm the 
often-repeated obser vations [30]: 


e Most files are small. 

e Mostaccesses are to small files. 

e Most storage is consumed by large files, which 
are, most of the time, accessed sequential ly. 


Although one could imagine many complex data- 
placement algorithms (including LRU-style migration 
of unused files to the disk), we have taken advantage of 
the above characteristics by using a simple threshold to 
choose which files are candidates for disk storage. 
Only the data content of files above the threshold 
(currently | MB) are stored on disk. Smaller files, as 
well as metadata, executables, and libraries, are stored 
in RAM. The current choice of threshold works well, 
leaving 99% of all files in RAM in our tests. By 
enlarging this threshold, Conquest can incrementally 
use more RAM storage as the price of RAM declines. 
The current threshold was chosen somewhat arbitrarily, 
and future research will examine its appropriateness. 

The decision to use a threshold simplifies the code, 
yet does not waste an unreasonable amount of memory 
since small files do not consume a large amount of total 
space. An additional advantage of the size-based 
threshold is that all on-disk files are large, which allows 
us to achieve significant simplifications in disk layout. 
For example, we can avoid adding complexity to handle 
fragmentation with "large" and "small" disk blocks, as 
in FFS [20]. Since we assume cheap and abundant 
RAM, the advantages of using a threshold far out weigh 
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the small amount of space lost by storing rarely used 
files in RAM. 

3.2.1 ‘Files Stored in Persistent RAM 

Small files and metadata benefit the most from being 
stored in persistent RAM, given that they are more 
susceptible to disk-related overheads. Since persistent 
RAM access granularity is byte-oriented rather than 
block-oriented, a single-byte access can be six orders of 
magnitude faster than accessing disk [23]. 

Metadata no longer have dual representations, one 
in memory and one on disk. The removal of the disk 
representation also removes the complex synchronous 
or asynchronous mechanisms needed to propagate the 
metadata changes to disk [20, 12, 38], and avoids 
translation between the memory’ and__ disk 
representations. 

At this time, Conquest does not give special 
treatment to executables and shared libraries by forcing 
them into memory, but we anticipate benefits from 
doing so. In-place execution will reduce startup costs 
and the time involved in faulting pages into memory 
during execution. Since shared libraries are modular 
extensions of executables, we intend to store them in- 
core as well.’ 

3.2.2. Large-File-Only Disk Storage 

Historically, the handling of small files has been one 
major source of file system design complexity. Since 
small files are accessed frequently, and a small transfer 
size makes mechanical overheads significant, designers 
employ various techniques to speed up small-file 
accesses. For example, the content of small files can be 
stored in the metadata directly, or a directory structure 
can be mapped into a balanced tree on disk to ensure a 
minimum number of indirections before locating a 
small file [26]. Methods to reduce the seek time and 
rotational latency [20] are other attempts to speed up 
small-file accesses. 

Small files introduce significant storage overhead 
because optimal disk-access granularities tend to be 
large and fixed, causing excessive internal 
fragmentation. Although reducing the access 
granularity necessitates higher overhead and lower disk 
bandwidth, the common remedy is nevertheless to 
introduce sub-granularities and extra management code 
to handle small files. 

Large-file-only disk storage can avoid all these 
small-file-related complexities, and management 
overhead can be reduced with coarser access 
granularity. Sequential-access-mostly large files 





’ Shared libraries can be trivially identified through magic numbers 
and existing naming and placement conventions. 
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exhibit well-defined read-ahead semantics. Large files 
are also read-mostly and incur little synchronization- 
related overhead. Combined with large data transfers 
and the lack of disk arm movements, disks can deliver 
near raw bandwidth when accessing such files. 


3.3 Metadata Representation 

How file system metadata ts handled is critical, since 
this information is in the path of all file accesses. 
Below, we outline how Conquest optimizes behavior by 
its choices of metadata representation. 

3.3.1 In-Core File Metadata 

One major simplification of our metadata representation 
is the removal of nested indirect blocks from the 
commonly used i-node design. Conquest stores small 
files, metadata, executables, and shared libraries in 
persistent RAM, via uniform, single-level, dynamically 
allocated index blocks, so in-core data blocks are 
virtually contiguous. 

Conquest does not use the v-node data structure 
provided by VFS to store metadata, because the v-node 
is designed to accommodate different file systems with 
a wide variety of attributes. Also, Conquest does not 
need many mechanisms involved in manipulating v- 
nodes, such as metadata caching. Conquest's file 
metadata consists of only the fields (53 bytes) needed to 
conform to POSIX specifications. 

To avoid file metadata management, we use the 
memory addresses of the Conquest file metadata as 
unique IDs. By leveraging the existing memory 
management code, this approach ensures unique file 
metadata IDs, no duplicate allocation, and fast retrieval 
of the file metadata. The downside of this decision is 
that we may need to modify the memory manager to 
anticipate that certain allocations will be relatively 
permanent. 

For small in-core write requests where the total 
allocation is unknown in advance, Conquest allocates 
data blocks incrementally. The current implementation 
does not retum unused memory in the last block of a 
file, though we plan to add automatic truncation as a 
future optimization. Conquest also supports “holes” 
within a file, since they are commonly seen during 
compilation and other activities. 

3.3.2. Directory Metadata 

We used a variant of extensible hashing [11] for our 
directory representation. The directory structure is built 
with a hierarchy of hash tables, using file names as 
keys. Collisions are resolved by splitting (or doubling) 
hash indices and unmasking an additional hash bit for 
each key. A path ts resolved by recursively hashing 
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each name component of the path at each level of the 
hash table. 

Compared to ext2’s approach, hashing removes the 
need to compact directories that live in multiple 
(possibly indirect) blocks. Also, the use of hashing 
easily supports hard links by allowing multiple names 
to hash to the same file metadata entry. In addition, 
extendible hashing preserves the ordering of hashed 
items when changing the table size, and this property 
allows readdir() to walk through a directory 
correctly while resizing a hash table (€.g., recursive 
deletions). 

One concern with using extensible hashing 1s the 
wasted indices due to collisions and subsequent 
splitting of hash indices. However, we found that 
alternative compact hashing schemes would consume 
similar amount of space to preserve ordering during a 
resize operation. 

3.3.3 Large-File Metadata 

For the data content of large files on disk, Conquest 
currently maintains a dynamically allocated, doubly 
linked list of segments to keep track of disk storage 
locations. Disk storage is allocated contiguously 
whenever possible, in temporal, or LFS, order [35]. 

Although we have a linear search structure, its 
simplicity and in-core speed outweigh its algorithmic 
inefficiency, as we will demonstrate in the performance 
evaluation (Section 5). in the worst case of severe disk 
fragmentation, we will encounter a linear slowdown in 
traversing the metadata. However, given that we have 
coarse disk-management granularity, the segment list 1s 
likely to be short. Also, since the search 1s in-core but 
access 1s limited by disk bandwidth, we expect little 
performance degradation for random accesses to large 
files. 

Currently, we store the large-file data blocks 
sequentially as the write requests arrive, without regard 
to file membership. We chose this temporal order only 
for simplicity in the initial implementation. Unlike 
LFS, we keep metadata in-core, and existing file blocks 
are updated in-place as opposed to appending various 
versions of data blocks to the end of the log. Therefore, 
Conquest does not consume contiguous regions of disk 
space nearly as fast as LFS, and demands no continuous 
background disk cleaning. 

Still, our eventual goal is to apply existing 
approaches from both video-on-demand (VoD) servers 
and traditional file systems research to design the final 
layout. For example, given its sequential-access 
nature, a large media file can be striped across disk 
zones, so disk scanning can serve concurrent accesses 
more effectively [8]. Frequently accessed large files 
can be stored completely near outer zones for higher 
disk bandwidth. Spatial and temporal ordering can be 


applied within each disk zone, at the granularity of an 
enlarged disk block. 

With a_ variety of options available, the 
presumption is that after enlarging the disk access 
granularity for large file accesses, disk transfer time 
will dominate access times. Since most large files are 
accessed sequentially, iO buffering and_ simple 
predictive prefetching methods should still be able to 
deliver good read bandwidth. 


3.4 Memory Management 


Although it reuses the code of the existing memory 
manager, Conquest has its own dedicated instances of 
the manager, residing completely in Conquest, with 
each goveming its Own memory region. Since all 
references within a Conquest memory manager are 
encapsulated within its governed region, and each 
region has its own dedicated physical address space, we 
can save and restore the runtime states of a Conquest 
memory manager directly in-core without serialization 
and deserialization. 

Conquest avoids memory fragmentation by using 
existing mechanisms built in various layers of the 
memory managers under Linux. For sub-block 
allocations, the slab allocator compacts small memory 
requests according to object types and sizes [4]. For 
block-level allocations, memory mapping assures 
virtual contiguity without external fragmentation. 

In the case of in-core storage depletion, we have 
several options. The simplest handling is to declare the 
resource depleted, which is our current approach (the 
same as is used for PDAs). However, under Conquest, 
this option implies that storage capacity is now limited 
by both memory and disk capacities. Dynamically 
adjusting the in-core storage threshold is another 
possibility, but changing the threshold can potentially 
lead to a massive migration of files. Our disk storage is 
potentially threatened with smaller-than-expected files 
and associated performance degradation. 


3.5 Reliability 


Storing data in-core inevitably raises the question of 
reliability and data integrity. At the conceptual level, 
disk storage is often less vulnerable to corruption by 
software failures because it is less likely to perform 
illegal operations through the rigid disk interface, 
unless memory-mapped. Main memory has a very 
simple interface, which allows a greater risk of 
corruption. A single wild kernel pointer could easily 
destroy many important files. However, a_ study 
conducted at the University of Michigan has shown that 
the risk of data corruption due to kernel failures is less 
than one might expect. Assuming one system crash 
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every two months, one can expect to lose in-memory 
data about once a decade [27]. 

Another threat to the reliability of an in-memory 
file system is the hardware itself. Modern disks have a 
mean time between failures (MTBF) of 1 million hours 
[37]. Two hardware components, the RAM and the 
battery backup system, cause Conquest's MTBF to be 
different from that of a disk. In our prototype, we use a 
UPS as the battery backup. The MTBF of a modern 
UPS is lower than that of disks, but is still around 
100,000 hours (14, 36]. The MTBF of the RAM is 
comparable to disk [22]; however, the MTBF of 
Conquest is dominated by the characteristics of the 
complete computer system; modern machines again 
have an MTBF of over 100,000 hours. Thus, it can be 
seen that Conquest should lose data due to hardware 
failures at most once every few years. This is well 
within the range that users find acceptable in 
combination with standard backup procedures. 

At the implementation level, an extension is to use 
approaches similar to Rio [7], which allows volatile 
memory to be used as a persistent store with little 
overhead. For metadata, we rely heavily on atomic 
pointer commits. In the event of crashes, the system 
integrity can remain intact, at the cost of potential 
memory leaks (which can be cleaned by fsck) for in- 
transit memory allocations. 

In addition, we can still apply the conventional 
techniques of —= sandboxing, access _ control, 
checkpointing, fsck, and _ object-oriented _ self- 
verification. For example, Conquest still needs to 
perfonn system backups. Conquest uses common 
memory protection mechanisms by having a dedicated 
memory address space for storage (assuming a 64-bit 
address space). A periodic fsck is still necessary, but it 
can run at memory speed. We are also exploring the 
object-store approach of having a “typed” memory area, 
sO a pointer can be verified to be of a certain type 
before being accessed. 

3.6 64-Bit Addressing 

Having a dedicated physical address space in which to 
run Conquest significantly reduces the 32-bit address 
space and raises the question of 64-bit addressing. 
However, our current implementation on a 32-bit 
machine demonstrates that 64-bit addressing 
implications are largely orthogonal to materializing 
Conquest, although a wide address space does offier 
many future extensions (i.e., having distributed 
Conquest sharing the same address space, so pointers 
can be stored directly and transferred across machine 
boundaries as in [6].) 
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A Conquest Implementation Status 


The Conquest prototype is operational as a loadable 
kernel module under Linux 2.4.2. |The current 
implementation follows the VFS API, but we need to 
override generic file access routines at times to provide 
both in-core and on-disk accesses. For example, inside 
the read routine, we assume that accessing memory Is 
the common case, while providing a forwarding path 
for disk accesses. The in-core data path no longer 
contains code for checking the status of the buffer 
cache, faulting and prefetching pages from disk, 
flushing dirty pages to disk to make space, performing 
garbage collection, and so on. The disk data path no 
longer contains mechanisms for on-disk metadata 
chasing and various internal fragmentation and seek- 
time optimizations for small files. 

Because we found it relatively difficult to alter the 
VFS to not cache metadata, we needed to pass our 
metadata structures through VFS calls such as mknod, 
unlink, and lookup. We altered the VFS, so that the 
Conquest root node and metadata are not destroyed at 
umount times. 

We modified the Linux memory manager in 
several ways. First, we introduced Conquest zones. 
With the flexibility built into the Linux zone allocator, 
it 1s feasible to allocate unused Conquest memory 
within a zone to perform other tasks such as IO 
buffering and program execution. However, we chose 
to manage memory at the coarser grain of zones, to 
conserve memory 1n a simpler way. 

The Conquest memory manager is instantiated top- 
down instead of bottom-up, meaning Conquest uses 
high-level slab allocator constructs to instantiate 
dedicated Conquest slab managers, then lower-level 
zone and page managers. By using high-level 
constructs, we only need to build an instantiation 
routine, invoked at file system creation times. 

Since Conquest managers reside completely in the 
memory region they govern, runtime states (1.e., 
pointers) of Conquest managers can survive reboots 
with only code written for reconnecting several data 
structure entry points back to Conquest runtime 
managers. No pointer translation was required. 

Conquest is POSIX-compliant and supports both 
in-core and on-disk storage. We use a I-MB static 
dividing line to separate small files from large files 
(Section 3.2). Large files are stored on disk in 4-KB 
blocks, so that we can use the existing paging and 
protection code without alterations. An optimization is 
to enlarge the block size to 64 KB or 256 KB for better 
performance. 
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> Conquest Performance 


We compared Conquest with ext2 [5], reiserfs [26], 
SGI XFS [40], and ramfs by Transmeta. We chose 
ext2, reiserfs, and SG/ XFS largely because they are the 
common basis for various perfonnance comparisons. 
Note that with 2 Gbytes of physical RAM, these disk- 
based file systems use caching extensively, and our 
performance numbers reflect how well these file 
systems exploit memory hardware. In the experiments, 
all file systems have the same amount of memory 
available as Conquest. 

Ramfs by Transmeta uses the page cache and v- 
nodes to store the file system content and metadata 
directly, and ramfs provides no means of achieving data 
persistence after a system reboot. Given that both 
Conquest and ramfs are under the VFS API and various 
OS legacy constraints, ramfs should approximate the 
practical achievable bound for Conquest performance. 
Our experimental platform is described in Table 5.1. 
Various file system settings are listed in Table 5.2. 


Experimental platfonn 


Manufacturer Dell PowerEdge 4400 

model 

Processor 1 GHz 32-bit Xeon Pentium 

Processor bus 133 MHz 

Memory 4x512 MB, Micron MTI8LSDT6472G, 
SYNCH, 133 MHz, CL3, ECC 

12 cache 256 KB Advanced 

Disk 73.4 GB, 10,000 RPM, Seagate ST173404LC 

Disk partition 6.1 GB partition starting at cylinder 7197 

for testing 

VO adaptor Adaptec AIC-7899 Ultra 160/m SCSt host 
Adaptor. BIOS v25306 

UPS APC Smart-UPS 700 

OS Linux 2.4.2 


Table 5.1; Experimental platform. 


File system settin gs 


cfs creation: default, mount: default 
ext2fs (0.56) creation: default, mount: default 


framsmeta ramfs creation: default, mount: default 
reiser[* (3.6.25) creation: default, mount: -o notail 
SGIXFS (1.0) creation: -] size=32768b 


mount: -o logbufis=8, logbsize32768 


Table 5.2: File system settings. 


We used the Sprite LFS microbenchmarks [35]. As for 
macrobenchmarks, the most widely used in the file 
system literature 1s the Andrew File System Benchmark 
[16]. Unfortunately, this benchmark no longer stresses 
modern file systems because its data set is too small. 
Instead, we present results from the PostMark 


macrobenchmark’ [18] and our modified PostMark 
macrobenchmark, which is described in Section 5.3. 
All results are presented at a 90% confidence level. 


5.1 Sprite LFS Microbenchmarks 


The Sprite LFS microbenchmarks measure the latency 
and throughput of various file operations, and the 
benchmark suite consists of two separate tests for small 
and large files. 

5.1.1. Small-File Benchmark 

The small-file benchmark measures the latency of file 
operations, and consists of creating, reading, and 
unlinking 10,000 I-KB files, in three separate phases. 
Figure 5.1 summarizes the results. 

Conquest vs ramfs: Compared to ramfs, 
Conquest incurs 5% and 13% overheads in file creation 
and deletion respectively, because Conquest maintains 
its own metadata and hashing data structures to support 
persistence, which is not provided by ramfs. Also, we 
have not removed or disabled VFS caching for 
metadata; therefore, VFS needs to go through an extra 
level of indirection to access Conquest metadata at 
times, while ramfs stores its metadata in cache. 

Nevertheless, Conquest has demonstrated a 15% 
faster read transaction rate than ramfs, even when ramfs 
is performing at near-memory bandwidth. Conquest is 
able to improve this aspect of performance because the 
critical path to the in-core data contains no generic 
disk-related code, such as readahead and checking for 
cache status. 

Conquest vs. disk-based file systems: Compared 
to evrt2, Conquest demonstrates a 50% _ speed 
improvement for creation and _ deletion, mostly 
attributable to the lack of synchronous metadata 
manipulations. Like ramfs, ext2 uses the generic disk 
access routines provided by VFS, and Conquest is 19% 
faster in read performance than cached ex?2. 

The performance of SG/ XFS and reiserfs is slower 
than exr2 because of both journaling overheads and 
their in-memory behaviors. Reiserfs actually achieved 
poorer performance with its original default settings. 
Interestingly, reiserfs performs better with the notail 
option, which disables certain disk optimizations for 
small files and the fractional block at the end of large 
files. While the intent of these disk optimizations is to 
save extra disk accesses, their overhead outweighs the 
benefits when there is sufficient memory to buffer disk 
accesses. 


? As downloaded, Postmark v1.5 reported times only toa 1-second 


resolution. We have altered the benchmark to report timing data at 
the resolution of the system clock. 
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As for SG/ XFS, its original default settings also 
produced poorer performance, since journaling 
consumes the log buffier quite rapidly. As we increased 
the buffer size for logging, SG/ XFS performance 
improved. The numbers for both reiserfs and SG/ XFS 
suggest that the overhead of journaling is very high. 
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Figure 5.1: Transaction rate for the different phases of the Sprite LFS 
small-file benchmark, run over SG/ XFS, reiserfs, ext2, ramfs, and 
Conguest. The benchmark creates, reads, and unlinks 10,000 1-KB 
files in separate phases. In this and most subsequent figures, the 90% 
confidence bars are nearly invisible due to the natrow confidence 
intervals, 


5.1.2. Large-File Benchmark 

The large-file benchmark writes a large file sequentially 
(with flushing), reads from it sequentially, and then 
writes a new large file randomly (with flushing), reads 
it randomly, and finally reads it sequentially. The final 
read phase was originally designed to measure 
sequential read performance after random write 
requests were sequentially appended to the log in a log- 
structured file system. Data was flushed to disk at the 
end of each write phase. 

For Conquest on-disk files, we altered the large-file 
benchmark to perform each phase of the benchmark on 
forty I100-MB files before moving to the next phase. 
Since we have a dividing line between small and large 
files, we also investigated the sizes of | MB and 1.01 
MB, with each phase of benchmark performed on ten I- 
MB or |.01-MB files. In addition, we memory-aligned 
all random accesses to reflect real-world usage patterns. 

The 1-MB benchmark: The I-MB large-file 
benchmark measures the throughput of Conquest’s in- 
core files (Figure 5.2a). Compared to ramfs, Conquest 
achieves 8% higher bandwidth in both random and 
sequential writes and 15% to 17% higher bandwidth in 
both random and sequential reads. It is interesting to 
observe that random memory writes and reads are faster 
than corresponding sequential accesses. This is because 
of cache hits: for 1-MB memory accesses with a 256- 
KB L2 cache size, random accesses have a roughly 
25% chance of reusing the L2 cache content. We 
believe that the difference is larger for writes because 
of a write-back, write-allocate L2 cache design, which 
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incurs additional overhead on sequential writes of large 
amounts of data. 
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(a) Sprite LFS large-file benchmark for I-MB (in-core Conquest) 
files. 
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(b) Sprite LFS large-file benchmark for !.01MB (on-disk Conquest) 
files. 


30 - 
25 - 
20 - 
M8! sec 15 - 
10 - 





ext2is Sicfs 


xfs — reiseifs & 


(c) Sprite LFS large-file benchmark for 100-MB (on-disk Congue.st) 
files. 


Figure 5.2: Bandwidth for the difterent phases (sequential write, 
sequential read, random write, random read, sequential read) of the 
Sprite LFS large-file benchmarks, run over SG/ XFS, reiserfs, ext2, 
ramfs. and Conquest. These two tests compare the performance of 
in-core and on-disk files under Conguest. 


Compared to disk-based file systems, Conquest 
demonstrates a 1900% speed improvement in sequential 
writes over exf2, 15% in sequential reads, 6700% in 
random writes, and 18% in random reads. SG/ XFS and 
reiserfs perform either comparably to or slower than 
ext2, 
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The 1.01-MB benchmark: The |.01-MB large- 
file benchmark shows the performance effects of 
switching a file from memory to disk under Conquest 
(Figure 5.2b). Conquest disk performance matches the 
performance of cached ex/2 pretty well. In our design, 
in-core and on-disk data accesses use dis joint data paths 
wherever possible, so Conquest imposes little or no 
extra overhead for disk accesses. 

The 100-MB benchmark: The 100-MB large-file 
benchmark measures the throughput of Conquest on- 
disk files (Figure 5.2c). We only compared against 
disk-based file systems because the total size exercised 
by the benchmark exceeds the capacity of ramfs. All 
file systems demonstrate similar performance. 
Compared to cached ext2, Conquest shows only 8% and 
4% improvements in sequential and random writes. We 
expect further performance improvements after 
enlarging the block size to 64 KB or 256 KB. 

5:2 PostMark Macrobenchmark 

The PostMark benchmark was designed to model the 
workload seen by Internet service providers [18]. 
Specifically, the workload is meant to simulate a 
combination of electronic mail, netnews, and web- 
based commerce transactions. 

PostMark creates a set of files with random sizes 
within a set range. The files are then subjected to 
transactions consisting of a pairing of file creation or 
deletion with file read or append. [Each pair of 
transactions is chosen randomly and can be biased via 
parameter settings. The sizes of these files are chosen 
at random and are uniformly distributed over the file 
size range. A deletion operation removes a file from 
the active set. A read operation reads a randomly 
selected file in entirety. An append operation opens a 
random file, seeks to the end of the file, and writes a 
random amount of data, not exceeding the maximum 
file size. 

We initially ran our experiments using the 
configuration of 10,000 files with a size range of 512 
bytes to 16 KB. One run of this configuration performs 
200,000 transactions with equal probability of creates 
and deletes, and a four times higher probability of 
performing reads than appends. The transaction block 
size 1s 512 bytes. However, since this workload 1s far 
smaller than the workload observed at any ISP today, 
we varied the total number of files from 5,000 to 30,000 
to see the effects of scaling. 

Another adjustment of the default setting is the 
assumption of a single flat directory. Since it is unusual 
to store 5,000 to 30,000 files in a single directory, we 
reconfigured PostMark to use one subdirectory level to 
distribute files uniformly, with the number of 
directories equal to the square root of the file set size. 


This setting ensures that each level has the same 
directory fanout. 

Since all files within the specified size range will 
be stored in memory under Conquest, this benchmark 
does not exercise the disk aspect of the Conquest file 
system. Also, since this configuration specifies an 
average file set of only 250 MB, which fits comfortably 
in 2 GB of memory, this benchmark compares the 
memory performance of Conquest against the 
performance of existing cache and IO _ buffering 
mechanisms, under a realistic mix of file operations. 
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Figure 5.3: PostMark transaction rate for SG] XFS, reiserfs, ext2, 
ramfs, and Conquest, varying from 5,000 and 30,000 files. The 
results arc averaged over five runs. 


Figure 5,3 compares the transaction rates of 
Conquest with various file systems as the number of 
files is varied from 5,000 to 30,000. First, the 
performance of Conquest differs little from ramfs 
performance. We feel comfortable with Conquest’s 
performance at this point, given that we still have room 
to reduce costs for at least sequential writes (enlarging 
the disk block size). Conquest outperforms ext2 
significantly; the performance gap widens from 24% to 
350% as the number of files increases. SGJ XFS and 
reiserfs perform slower than ext2 due to journaling 
overheads. 

For space reasons, we have omitted other graphs 
with similar trends—bandwidth, average creation rate, 
read rate, append rate, and average deletion rate. 

Taking a closer look at the file-creation component 
of the performance numbers, we can see that without 
interference from other types of file transactions 
(Figure 5.4a), file creation rates show little degradation 
for all systems as the number of files increases. When 
mixed with other types of file transactions (Figure 
5.4b), file creation rates degrade drastically. 

With only file creations, Conquest creates 9% 
fewer files per second than ramfs. However, when 
creations are mixed with other types of file transactions, 
Conquest creates files at a rate comparable to ranifs. 
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(a) PostMark file creation rate. 
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(b) PostMark file creation rate, mixed with other types of file 
transactions. 


Figure 5.4: PostMark file creation perfiormance for SG/ XFS, reiserfs, 
ext2, raimfs, and Congues/, varying from 5,000 and 30,000 files. The 
results are averaged over five runs. 


Compared to ext2, Conquest performs at a 26% 
faster creation rate (Figure 5.4a), compared to the 50% 
faster rate in the LFS Sprite benchmark. E£xt2 has a 
better creation rate under PostMark because files being 
created have larger file sizes. The write buffer used by 
ext2 narrows the performance difference of file creation 
when compared to Conquest. 

Similar to the comparison between Conquest and 
ramfs, it 1s interesting to see that SG/ XFS has a faster 
file creation rate than reiserfs without mixed traffic, but 
a slower rate than reiserfs with mixed traffic. This 
result demonstrates that optimizing individual 
operations in isolation does not necessarily produce 
better performance when mixed with other operations. 

We have omitted the graphs for file deletion, since 
they show similar trends. 


53 Modified Postmark Benchmark 

To exercise both the memory and disk components of 
Conquest, we modified the Postmark benchmark in the 
following way. We generated a percentage of files ina 
large-file category, with file sizes uniformly distributed 
between 2 MB and SMB. The remaining files were 
uniformly distributed between 512 bytes to 16 KB. We 
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fixed the total number of files at 10,000 and varied the 
percentage of large files from 0.0 to 10.0 (0 GB to 3.5 
GB). Since the file set exceeds the storage capacity of 
ramfs, we were forced to omit ramfs from our results. 

Figure 5.6 compares the transaction rate of SG/ 
XFS, reiserfs, ext2, and Conquest. Figure 5.6a shows 
how the measured transaction rates of the four flle 
systems vary as the percentage of large files increases. 
Because the scale of this graph obscures important 
detail at the right-hand side, Figure 5.6b zooms into the 
graph with an expanded vertical scale. Finally, Figure 
5.6c shows the performance ratio of Conquest over 
other disk-based file systems. 
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(a) The full-scale graph. 
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(b) The zoomed graph. 
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Figure 5.6: Modified PostMark transaction rate for SG/ XFS, reiser/s, 


ewf2, and Conquest, with varying percentages of large (on-disk 
Conquest) files ranging from 0.0 to 10.0 percent. 
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Conquest demonstrates 29% to 310% _ faster 

transfer rates than ext2 (Figure 5.6c). The shape of the 
Conquest speedup curve over exf2 reflects the rapid 
degradation of ext2 performance with the injection of 
disk traffic. As more disk traffic is injected, we start to 
see a relatively steady perfonnance ratio. At steady 
state, Conquest shows a 75% faster transaction rate than 
extd.. 
Both SG/ XFS and reiserfs show significantly slower 
memory performance (left side of Figure 5.6a). 
However, as the file set exceeds the memory size, SG/ 
XFS starts to outperform exr2 and reiserfs (Figure 5.6c). 
Clearly, different file systems are optimized for 
different conditions. 


6 Related Work 


The database community has a long established history 
of memory-only systems. An early survey paper 
reveals key architectural implications of sufficient 
RAM _ and identifies several early main memory 
databases [13]. The cost of main memory may be the 
primary concern that prevents operating systems from 
adopting similar solutions for practical use, and 
Conquest offers a transition for delivering file system 
services from main memory in a practical and cost- 
effective way. 

In the operating system arena, one early use of 
persistent RAM was for buffering write requests [2]. 
Since dirty data were buffered in persistent memory, 
the interval between synchronizations to the disk could 
be lengthened. 

The Rio file cache [28] combines UPS, volatile 
memory, and a modified write-back scheme to achieve 
the reliability of write-through file cache and 
performance of pure write-back file cache (with no 
reliability-induced writes to disk). The resiliency 
offered by Rio complements Conquest’s performance 
well. While Conquest uses main store as the final 
storage destination, Rio’s BIOS safe sync mechanism 
provides a high assurance of dumping Conquest 
memory to disk in the event of infrequent failures that 
require power cycles. 

Persistent RAM has been gaining acceptance as the 
primary storage medium on small mobile computing 
devices through a plethora of flash-memory-based file 
systems [29, 41]. Although this departure from disk 
storage marks a major milestone toward persistent- 
RAM-based storage, flash memory has_ some 
unpleasant characteristics, notably the limited number 
of erase-write cycles and slow (second-range) time for 
storage reclamation. These characteristics cause 
performance problems and introduce a different kind of 
operating system complexity. Our research currently 


focuses on the general performance characteristics 
exemplified by battery-backed DRAM (BB-DRAM). 

The leading PDA operating systems, PalmOS and 
Windows CE, deliver memory and file system services 
via BB-DRAM, but both systems are more concerned 
with fitting an operating system into a memory- 
constrained environment, in contrast to the assumed 
abundance of persistent RAM under Conquest. 
PalmOS lacks a full-featured execution model, and 
efficient methods for accessing large data objects are 
limited [1]. Windows CE is unsuitable for general 
desktop-scale deployment because it tries to shrink the 
full operating system environment to the scale of a 
PDA. Also, the Windows CE architecture inherits 
many disk-related assumptions [24]. 

IBM AS/400 servers provide the appearance of 
storing all files in memory from the user’s point of 
view. This uniform view of storage access Is 
accomplished by the extensive use of virtual memory. 
The AS/400 design its an example of how Conquest can 
enable a different file system API. However, 
underneath the hood of AS/400, conventional roles of 
memory acting as the cache for disk content still apply, 
and disks are still the persistent storage medium for 
files [17]. 

One form of persistent RAM under development is 
Magnetic RAM (MRAM) [3]. An ongoing project on 
MRAM-enabled storage, HeRMES, also takes 
advantage of persistent RAM technologies [25]. 
HeRMES uses MRAM primarily to store the file 
metadata to reduce a large component of existing disk 
traffic, and also to buffer writes to lengthen the time 
frame for committing modified data. HeRMES also 
assumes that persistent RAM will remain a relatively 
scarce resource for the foreseeable future, especially for 
large file systems. 


7 Lessons Learned 


Through the design and implementation of Conquest, 
we have learned the following major lessons: 

First, the handling of disk characteristics permeates 
file system design even at levels above the device layer. 
For example, default VFS routines contain readahead 
and buffer-cache mechanisms, which add high and 
unnecessary overheads to low-latency main store. 
Because we needed to bypass these mechanisms, 
building Conquest was much more difficult than we 
initially expected. For example, certain downstream 
storage routines anticipate data structures associated 
with disk handling. We either need to find ways to 
reuse these routines with memory data structures, or 
construct memory-specific access routines from scratch. 

Second, file systems that are optimized for disk are 
not suitable for an environment where memory is 
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abundant. For example, reiserfs and SG/ XFS do not 
exploit the speed of RAM as well as we anticipated. 
Disk-related optimizations impose high overheads for 
in-memory accesses. 

Third, matching the physical characteristics of 
media to storage objects provides opportunities for 
faster performance and considerable simplification for 
each medium-specific data path. Conquest applies this 
principle of specialization: leaving only the data 
content of large files on disk leads to simpler and 
cleaner management for both memory and disk storage. 
Ths observation may seem obvious, but results are not 
automatic. For example, if the cache footprint of two 
specialized data paths exceeds the size of a single 
generic data path, the resulting performance can go in 
either direction, depending on the size of the physical 
cache. 

Fourth, access to cached data in traditional file 
systems incurs performance costs due to commingled 
disk-related code. Removing disk-related complexity 
for in-core storage under Conquest therefore yields 
unexpected benefits even for cache accesses. In 
particular, we were surprsed to see Conquest 
outperform ramfs by 15% in read bandwidth, knowing 
that storage data paths are already heavily optimized. 

Finally, it is much more difficult to use RAM to 
improve disk performance than it might appear at first. 
Simple approaches such as mcreasing the buffer cache 
size or installing simple RAM-disk drivers do not 
generate a full-featured, high-performance solution. 

The overall lesson that can be drawn is that 
seemingly simple changes can have much more fiar- 
reachng effects than first anticipated. The 
modifications may be more difficult than expected, but 
the benefits can also be far greater. 


8 Future Work 


Conquest is now operational, but we can firther 
improve its performance and usability in a number of 
ways. A few previously mentioned areas are designing 
mechansms for adjusting file size threshold 
dynamically (Section 3.4) and finding a better disk 
layout for large data blocks (Section 3.3.3). 

High-speed in-core storage also opens up 
additional possibilities for operating systems. Conquest 
provides a simple and efficient way for kernellevel 
code to access a general storage service, which is 
conventionally either avoided entirely or achieved 
through the use of more limited buffering mechanisms. 
One major area of application for this capability would 
be system monitoring and lightweight logging, but there 
are numerous other possibilities. 

In terms of research, so far we have aggressively 
removed many disk-related complexities ftom the in- 
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core critical path without questioning exactly how 
much each disk optimization adversely affects file 
system performance. One area of research ts to break 
down these performance costs, so designers can 
improve the memory performance for disk-based file 
systems. 

Memory under Conquest is a shared resource 
among execution, storage, and buffering for disk 
access. Finding the “sweet spot” for optimal system 
performance will require both modeling and empirical 
investigation. In addition, after reducing the roles of 
disk storage, Conquest exhibits different system-wide 
performance characteristics, and the implications can be 
subtle. For example, the conventional wsdom of 
mixing CPU- and 1O-bound jobs may no longer be a 
Suitable scheduling policy. We are currently 
experimenting with a wider variation of workloads to 
investigate a fuller range of Conquest behavior. 


9 Conclusion 


We have presented Conquest, a fully operational file 
system that integrates persistent RAM with disk storage 
to provide significantly improved performance 
compared to other approaches such as RAM disks or 
enlarged buffer caches. With the involvement of both 
memory and disk components, we measure a 43% to 
96% speedup compared to popular disk-based file 
systems. 

During the development of Conquest, we 
dis covered a number of unexpected results. Obvious ad 
hoc approaches not only fail to provide a complete 
solution, but perform more poorly than Conquest due to 
the unexpectedly high cost of going through the buffer 
cache and disk-specific code. We found that it was 
very difficult to remove the disk-based assumptions 
integrated into operating systems, a task that was 
necessary to allow Conquest to achieve its goals. 

The benefits of Conguest arose from rethinking 
basic file system design assumptions. This success 
Suggests that the radical changes in _ hardware, 
applications, and user expectations of the past decade 
should also lead us to rethink other aspects of operating 
system design. 
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Abstract 


The buffer-cache replacement policy of the OS can 
have a significant impact on the performance of I/O- 
intensive applications. In this paper, we introduce a 
simple fin gerprinting tool, Dust, which uncovers the re- 
place ment policy of the OS. Specifically, we are able to 
identify how initial access order, recency of access, fre- 
quency of access, and long-term history are used to de- 
termine which blocks are replaced from the buffer cache. 
We show that our fingerprinting tool can identify po pu- 
lar replace ment policies described in the literature (e.g., 
FIFO, LRU, LFU, Clock, Random, Segmented FIFO, 
2Q, and LRU-K) as well as those found in current sys- 
tems (e.g., NetBSD, Linux, and Solaris ). 

We demonstrate the usefulness of fin ger printing the 
cache replacement policy by modifying a web server to 
use this knowledge; specifically, the web server infers 
the contents of the OS file cache by modeling the re- 
placement policy under the given set of page requests. 
We show that by first servicing those web pages that are 
believed to be resident in the OS buffer cache, we can 
improve both average response time and throu ghput. 


1 Introduction 


Although the specific algorithms used to manage the 
buffer cache can significantly impact the performance of 
I/O-intensive applications [8, 13, 27], this knowledge is 
usually hidden from user processes. Currently, to de- 
termine the behavior of the buffer cache, implementors 
are forced to rely on available documentation, access to 
source code, or general knowledge of how buffer caches 
behave. 

Rather than relying on these ad hoc methods, we pro- 
pose the use of fm gerprinting to automatically uncover 
characteristics of the OS buffier cache. In this paper, we 
describe Dust, a simple fingerprinting tool that is able 
to identify the buffer-cache replacement policy; specif- 
ically, we identify whether it uses initial access order, 
recency of access, frequency of access, or historical in- 
formation. 

Fingerprinting can be described as the use of micro- 
benchmarking techniques to identify the algorithms and 
policies used by the system under test. The idea behind 


fingerprinting is to insert probes into the underlying sys- 
tem and to observe the resulting behavior through visible 
outputs. By carefully controlling the probes and match- 
ing the resulting output to the fingerprints of known al- 
gorithms, one can often identify the algorithm of the sys- 
tem under test. The key challenge is to inject probes 
to create distinctive fingerprints such that different algo- 
rithmic characteristics can be isolated. 

There are several significant advantages to using fin- 
gerprints for automatically identifying internal algo- 
rithms. First, fingerprinting eliminates the need for a 
developer to obtain documentation or source code to un- 
derstand the underlying system. Second, fingerprinting 
enables all programmers, not just those with sophisti- 
cated experience, to use algorithmic knowledge and thus 
improve performance. Third, fingerprinting can uncover 
bugs, or hidden complexities, in systems either under de- 
velopment or already deployed. Finally, fingerprinting 
can beused at run-time, allowing an adaptive application 
to modify its own behavior based on the characteristics 
of the underlying system. 

In this paper, we investigate a new use of algorith- 
mic knowledge: its use in exposing the current con- 
tents of the OS buffer cache. Recent work has shown 
that I/O-intensive applications can improve their perfor- 
mance given information about the contents of the file 
cache [3, 33]; specifically, applications that can handle 
data from diskina flexible order should first access those 
blocks in the buffer cache and then those on disk. How- 
ever, current approaches suffer from one of two limita- 
tions: they either require changes to the underlying OS 
to export this information or cannot accurately identify 
the presence of small files in the buffer cache. 

We observe that an application can model (or simu- 
late) the state of the buffer cache if it knows the replace- 
ment policy used by the OS and can see most file ac- 
cesses. A dedicated web server can greatly benefit from 
knowing the contents of the buffer cache and servic- 
ing first those requests that will hit in the buffer cache. 
We have implemented a cache-aware web server based 
on the NeST storage appliance [6] and show that this 
web server improves both average response time and 
throughput . 

In this paper we make the following contributions: 
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e We introduce Dust, a fingerprinting tool that au- 
tomatically identifies cache replacement policies 
based upon how they prioritize between initial ac- 
cess order, recency of access, frequency of access, 
and historical information. 


e We demonstrate through simulations that Dust can 
distinguish between a variety of replacement poli- 
cies found in the literature: FIFO, LRU, LFU, Ran- 
dom, Clock, Segmented FIFO, 2Q, and LRU-K. 


e We use our fingerprinting software to identify the 
replacement policies used in several operating sys- 
tems: NetBSD 1.5, Linux 2.2.19 and 2.4.14, and 
Solaris 2.7. 


e We show that by knowing the OS replacement pol- 
icy, acache-aware web server can first service those 
requests that can be satisfied within the OS buffer 
cache and thereby obtain substantial performance 
improvements. 


The rest of this paper is organized as follows. We 
begin in Section 2 by describing our fingerprinting ap- 
proach. In Section 3 we show via simulation that we 
can identify a range of popular replacement policies. 
In Section 4 we identify the replacement policies used 
in several current operating systems. In Section 5 we 
show how a web server can exploit knowledge of the 
buffer-cache replacement policy for improved perfor- 
mance. We briefly discuss related work in Section 6, 
and conclude in Section 7. 


2 Fingerprinting Methodology 


We now describe Dust, our software for identifying 
the page replacement policy employed by an operating 
system. By manipulating how blocks are accessed, forc- 
ing evictions, and then observing which blocks are re- 
placed, Dust can identify the parameters used by the 
page replacement policy and the corresponding algo- 
rithm. 

Dust relies upon probes to infer the current state of 
the buffer cache. By measuring the time to read a byte 
within a file block, onecan determine whether or not that 
block was previously in the buffer cache. Intuitively, if 
the probe is “slow”, one infers that the block was previ- 
ously on disk; if the probe is “fast”, then one infers that 
the block was already in the cache. 

For Dust to correctly distinguish between different re- 
placement polices, we must first identify the file block 
attributes used by existing policies to select a victim 
block for replacement. From a search of the OS and 
database research literature and the documentation of 
existing operating systems, we have identified four at- 
tributes that are often used for replacement: the order 
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of initial access to the block (e.g., FIFO), the recency 
of accesses (e.g., LRU), the frequency of accesses (e.g., 
LFU) and historical accesses to blocks (e.g., 2Q [12]). 
Thus, we can correctly identify the use of combinations 
of these four attributes within a replacement policy. 

We note that some operating systems use replacement 
policies that consider attributes beyond what Dust con- 
siders. For example, some replacement policies consider 
whether or not pages are dirty [16], the size of the file 
the page 1s from, or replacement cost [10]. Further, re- 
placement of pages can be performed on either a global 
or per process basis [14]. Finally, in real systems, not 
only are file pages cached, but file meta-data as well, 
and some systems prefer to evict pages from files whose 
meta-data is no longercached. It is also possible that fu- 
ture replacement policies may utilize new attributes that 
we do not currently fingerprint. Although Dust can not 
currently identify these parameters, we believe that the 
basic framework within Dust can be extended to do so. 

Given our goal of identifying replacement policies, 
there are three primary components to Dust. First, the 
size of the buffer cache 1s measured with a simple mi- 
crobenchmark; this value is used as input to the remain- 
ing steps. Second, the short-term replacement algorithm 
is fingerprinted, based upon initial access, recency of ac- 
cess, and frequency of access. Third, Dust determines 
whether or not long-term history 1s used by the replace- 
ment algorithm. 


2.1 Microbenchmarking Buffer Cache Size 


To manipulate the state of the buffer cache and inter- 
pret its contents, Dust must first know the size of the 
buffer cache. Since this information is not readily avail- 
able through a common interface on most systems, Dust 
contains a simple microbenchmark. Dust accesses pro- 
gressively larger amounts of file data until it notices that 
some blocks no longer fit the cache. For each increase in 
the tested size, there are two steps. In the first step, Dust 
touches the file blocks up through the newly increased 
size to fetch them into the buffer cache. In the second 
step, Dust probes each block again, measuring the time 
per probe to verify if the block is still in the cache. This 
technique is similar to the technique used to determine 
available memory in NOW-Sort [4]. 

There are two important features of this approach. 
First, by probing every file block in the second step, 
this algorithm 1s independent of the replacement policy 
used to manage the buffer cache. Second, this algorithm 
works even when the buffer cache 1s integrated with the 
virtual memory system, assuming that Dust uses little 
memory and the buffer cache is able to grow to its max- 
imum size. Further, as we will show, our fingerprinting 
algorithm is robust to slight inaccuracies in our estima- 
tion of the buffer cache size. 
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Figure 1: Short-Term Attributes of Blocks. The three graphs show the priority of each block within the test region according to 
the three metrics: order of initial access, recency of access, and frequency of access. The x-axis indicates the block number within 


the file forming the test region. The y-axes indicates the initial accesses order (left), recency of access (center) and frequency of 


access (right). 


2.2 Fingerprinting Replacement Attributes 


Once the buffier cache size is known, Dust determines 
the attributes of file blocks that are used by the OS short- 
term replacement policy. This fingerprinting stage in- 
volves three simple steps. First, Dust reads file blocks 
into the buffer cache while simultaneously controlling 
the replacement attributes of each block (e.g., by ac- 
cessing blocks in different initial access, recency, and 
frequency orders). Second, Dust forces some of these 
blocks to be evicted from the buffer cache by accessing 
additional file data. Finally, the contents of the buffter 
cache are inferred by probing random sets of blocks; the 
cache state of these file blocks 1s then plotted to illustrate 
the replacement policy. We now describe each of these 
three steps in detail. 


2.2.1 Configuring Attributes 


The first step moves the buffer cache into a known and 
well-controlled state — both the data blocks that are res- 
ident and the initial access, recency, and frequency at- 
tributes of each resident block. This control is imposed 
by performing a pattern of reads over blocks within a 
single file; we refer to these blocks as the test region. 
To ensure that all of this data is resident, the size of this 
testregionis set slightly smaller than the estimate of the 
buffer cache size (precisely, we use only 90% of the es- 
timated cache size and adjust the size such that each of 
ten stripes discussed below are page aligned). 

Controlling the initial access parameter of each block 
allows Dust to identify replacement policies that are 
based on the initial access order of blocks (e.g., FIFO). 
To exert this control, our access pattern begins with a 
sequential scan of the test region. The resulting initial 
access queue ordering is shown in the first graph of Fig- 
ure 1; specifically, the blocks at the end of the file are 
those that are given priority (i.e., remain in the buffer 
cache) given a FIFO-based policy. 

Dust is able to identify replacement policies that are 


based on temporal locality (e.2., LRU) by controlling 
how recently each block is accessed and ensuring that 
this ordering does not match the initial access order- 
ing. To ensure this criteria, a pattern of reads across ten 
stripes within the file are performed. Specifically, two 
indices into the file are maintained: a left pointer, which 
Starts at the beginning of the file, and a right pointer, 
which starts at the center of the test region. The work- 
load alternates between reading one stripe as indicated 
by the left pointer and then one stripe as indicated by 
the right pointer. The pattern continues until the left 
pointer reaches the center of the test region and the right 
pointer reaches the end. This controlled pattern of ac- 
cess induces the recency queue order shown in the mid- 
dle graph of Figure 1; specifically, the blocks at the end 
of the left and right regions are those given priority with 
an LRU-based policy. 


Finally, to identify policies that have a frequency 
based component, Dust ensures that stripes in the test 
region have distinctive frequency counts. When read- 
ing stripes forrecency ordering, Dust touches each stripe 
multiple times for a frequency ordering as well. In our 
pattern, stripes near the center of the test region are read 
the most often, and those near the beginning and end of 
the test region are read the least. The number of reads 
for each area of the test region is shown in the right-most 
graph of Figure 1, where blocks in the middle are given 
priority with an LFU-based policy. 


The need to impose different frequencies on differ- 
ent parts of the file is part of the motivation for divid- 
ing the test region into a fixed number of stripes. If, for 
instance, each block of the test region were given a dif- 
ferent frequency count, the runtime of Dust would be 
exponential in the size of the file. In our simulation ex- 
periments, we determined ten to be a good number. The 
more stripes used, the more precise the fingerprint be- 
comes since there is a greater variety of frequency and 
recency regimes. However, a greater number of stripes 
makes each stripe smaller thus making the data more 


General Track: 2002 USENIX Annual Technical Conference 





e Initial Access Order = Access Recency Access Frequency 
: ¢ 20000 7 
§ 2 18000 | A 
2 
5, 16000 
= £ 14000 = 5] 
2 = 12000 Ba 
2 & 10000 g 
= = 3 
5 — i 
a 
qo  °000 2 
& 4000 i | 
6 2000 
Z 0 4000 8000 12000 16000 20000 4 0 4000 68000 12000 16000 20000 0 4000 8000 12000 16000 20000 
= 





31 


52 


susceptible to noise. 


2.2.2 Forcing Evictions 


Once the state of the buffer cache is configured, Dust 
performs an eviction scan in which more file data is read 
to cause some portion of the test region to be evicted 
from the cache. Since the goal of evicting pages is to 
give us the most information and ability to differentiate 
across replacement policies, Dust tries to evict approxi- 
mately half of the cached data.’ 

We note that the eviction scan must read each page 
multiple times such that the frequency counts of its 
pages are higher than those of the pages in the test re- 
gion. Otherwise, Dust is not able to identify a frequency- 
based replacement policies since the eviction region 
would replace its own pages. This illustrates one of 
the limitations of our approach: we do not differentiate 
between LIFO, MRU, and MFU replacement policies, 
since all replace the eviction region with itself. How- 
ever, we feel that this limitation is acceptable, given that 
such policies are used when streaming through large files 
and all tend to behave similarly under such conditions. 


2.2.3 Probing File-Buffer Contents 


To determine the state of the buffer cache after the evic- 
tion scan, we perform several probes, measuring the time 
toread one byte from selected pages. If the read call re- 
turns quickly, we assume the block of the file was resi- 
dent in the cache; if the read returns slowly, we assume 
that a disk access was required. As noted elsewhere [3], 
it is not possible to perform a probe of every block to de- 
termine its state since this changes the state of the buffer 
cache; specifically, if Dust probes a block that was on 
disk, then this block will replace a block previously in 
the buffer cache, changing its state. Thus, we perform 
probes selectively. 

To obtain an appropriate number of samples, we probe 
each stripe two times, for a total of twenty probes. The 
probes are spaced evenly across the test region, but the 
location of the first is chosen randomly from the first 
half of the first stripe. By keeping the probes relatively 
far apart, we ensure that they do not interfere with a 
later probe due to prefetching. Choosing a random offset 
for the probes allows one to run the benchmark multiple 
times to generate a better picture of the cache state. By 
running Dust multiple times on a platform, one is then 
able to accurately determine how the cache replacement 
policy chooses victim pages based on initial access, re- 
cency of access, and frequency of access. 


' Precisely, the size of the eviction scan is set equal to the differ- 
ence between the size of the cache and the size of the test region (i.e., 
0.1*cache size) plus one half the size of the cache. 


General Track: 2002 USENIX Annual Technical Conference 


Fvict 2 


Hat Cold 





Figure 2: Access Pattern to Fingerprint History. Four dis- 
tinct regions of file blocks (i.e., hot, cold, evicti, and evict2) 


are accessed to set attributes and cause evictions in order to 
identify whether or not history is being used by the replace- 


ment algorithrmn. Each arrow indicates a region that is being 
accessed; reads later in time move down the page. The width 
of each arrow along with a nurnber, shows the number of times 
each block is read to set the frequency attributes. 


2.3 Fingerprinting History 


The fingerprinting tool described thus far can identify 
replacement policies containing a single queue ranking 
blocks based upon the three attributes. However, the 
previous step controls only the short-term attributes of 
blocks and thus cannotidentify algorithms that track ref- 
erences to blocks that are no longer in memory (e.g., 
2Q [12]) or that track the recency of references other 
than the last reference to each block (e.g., LRU-K [19]). 
To determine if long-term tracking is performed, Dust 
observes if preference is given to pages that have been 
referenced and then evicted before. 

We now describe how the use of long-term history is 
identified. As shown in Figure 2, there are four regions 
of file blocks that are now accessed. The test region is 
now divided into two separate regions that are one half 
the total cache size, a hot and acold portion. The algo- 
rithm begins by touching all of the hot pages and then 
evicting them by twice touching the evict/ region; the 
evict! region contains sufficient blocks to entirely fill 
the buffer cache. Thus, the hot pages are no longer in 
the cache, but historical information about them is now 
tracked. Dust then touches the hot and cold regions three 
times and then touches cold two more times. At this 
point, evict! has been evicted entirely and cold is pre- 
ferred whether initial access, recency or frequency at- 
tributes are being used by the replacement policy. Then 
cold is touched twice. This causes the cold region to 
be preferred by traditional LRU and LFU. Hot is then 
retouched, this additional reference gives the hot region 
preference in policies which use history. The last step 
prior to eviction is to rereference both the hot and cold 
regions sequentially. Notice that at this point the hot re- 
gion has been touched the same number of times as the 
cold region but, it has been touched in such a way that it 
will have migrated into the long-term queue of a 2Q or 
LRU-2 cache, while the cold region will have not. 

As in the short-term fingerprint, the next phase of Dust 
is to probe the test region to determine which blocks 
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Figure 3: Fingerprints of Basic Replacement Policies (FIFO, LRU, LFU). The three graphs show the time required to probe 
blocks within the test region of a file depending uponthe buffer cache replacement policy. The x-axis shows the offset of the probed 
block. The y-axis shows the time required for that probe; where low times (2s) indicate the block was in cache, whereas high 
times (7™ms) indicate the block was notin cache. From left to right, the graphs simulate FIFO, LRU, and LFU. 


have been kept in the file cache. If the hot region re- 
mains in the cache, then we infer that history is being 
used. If the cold region remains in the cache, then we 
infer that history is not being used. Given that further 
identification of history attributes is likely to be specific 
to each replacement algorithm, we fiocus on only this 
simple historical fingerprint. 


3 Simulation Fingerprints 


To illustrate the ability of Dust to accurately finger- 
print a variety of cache replacement policies, we have 
implemented a simple buffer cache simulator. In this 
section, we describe our simulation framework and then 
present a number of results. Our first simulation re- 
sults verify the distinctive short-term replacement fin- 
gerprints produced for the pure replacement policies of 
FIFO, LRU, and LFU [23], as well as for other simple 
replacement policies such as Random and Segmented 
FIFO [31]. To explore the impact of internal state within 
the replacement policy, we investigate Clock [18] and 
Two-handed Clock [32]. We then demonstrate our abil- 
ity to identify the use of historical information in the re- 
placement policy, focusing on 2Q [12] and LRU-K [19]. 
We conclude this section by showing that Dust is robust 
to some inaccuracy in its estimate of buffer-cache size. 


3.1 Simulation methodology 


Given that our simulator is meant only to illustrate the 
ability of Dust to identify different OS buffer cache re- 
placement policies, we keep the rest of the system as 
simple as possible. Specifically, we assume that the only 
process running is our fingerprinting software, and thus 
ignore irregularities due to scheduling interference. We 
currently model only a buffer cache of a fixed size and 
do not consider any contention with the virtual memory 
system. For most of our simulations, we model a buffer 
cache containing approximately 80 MB (or 20,0004 KB 
pages). Finally, we assume that reads that hit in the file 


cache require a constant time of 2 sts, whereas reads that 
must go to disk require 7 ms. 


3.2 Basic Replacement Policies 


We begin by showing that the simulation results for 
strict FIFO, LRU, and LFU replacement policies pre- 
cisely matches what one can derive from the ordering 
graphs shown in Figure |. The fingerprints from these 
three simulations are shown in Figure 3. We further 
show that Dust can identify Random replacement and 
Segmented FIFO [14]. These fingerprints are shown in 
Figure 4. Across all the graphs, one can observe the two 
levels of probe times, corresponding to blocks that are in 
cache and those that are not. Also, one can verify that 
approximately half of the test data remains in cache. 

We now examine these basic policies in turn. The 
FIFO fingerprint shows that the second half of the test 
region remains in cache; this matches the initial access 
ordering shown in Figure | where blocks at the end of 
the file have priority. The LRU fingerprint shows that 
roughly the second quarter and the fourth quarter of the 
test region remains in the buffer cache; once again, this 
is the expected behavior since those blocks have been 
accessed the most recently. Finally, the LFU fingerprint 
shows that middle half of the file remains resident, as 
expected, since those blocks have the highest frequency 
counts. In the LFU fingerprint, one can see two small 
discontinuous regions that remain in cache to the left 
and right of the main in-cache area; this behavior is due 
to the fact that within each stripe, blocks have the same 
frequency count and these in-cache regions are part of a 
stripe that was beginning to be evicted. 

Fingerprinting a Random replacement policy stresses 
the importance of running Dust multiple times. With a 
single fingerprint run of twenty probes, thereexists some 
probability that Random replacement behaves identi- 
cally to FIFO, LRU, or LFU. Therefore, by fingerprint- 
ing the system many times, we can definitively see that 
random pages are selected for replacement. This is illus- 
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Figure 4: Fingerprints of Random and Segmented FIFO. The left-most graph shows that a Random page replacement policy 
has a distinctive fingerprint; that each run of the finger print causes different pages to be evicted from the taffer cache. The middle 
graph shows Segmented FIFO with 30% of the buffer cache devoted to the secondary queue; the resulting fingerprint is a cyclic 
shift of the FIFO fingerprint. The right-most graph shows Segmented FIFO with at least 50% of the baffer cache devoted to the 
secondary queue; since this queue is managed with LRU, the finger print is identical to LRU. 


trated in the first graph of Figure 4 with two horizontal 
lines indicating the “fast” and “slow” access times. 


The original VMS system implemented the Seg- 
mented FIFO (SFIFO) page replacement policy [14]. 
SFIFO divides the buffer cache into two queues. The pri- 
mary queue is managed by FIFO. Non-resident pages are 
faulted into the primary queue. When a page is evicted 
from the primary queue, it is moved to the secondary 
queue. If a page is accessed while in the secondary 
queue, it moves back into the primary queue. The key 
parameter in SFIFO is the fraction of the buffer cache 
devoted to the secondary queue, denoted P (thus, 1 — P 
is the fraction devoted to the primary queue). 


A value of P = 0.3 is the traditional choice and is 
fingerprinted in the middle graph of Figure 4. The re- 
sulting SFIFO fingerprint is a cyclic shift of the pure 
FIFO fingerprint. The reason for this pattern is as fol- 
lows. The initial read of the test area sets the contents 
of the primary and secondary queues such that the first 
pages accessed (i.e., the left portion of the test area) are 
shifted down to the secondary queue and the tail of the 
primary queue; the right portion is at the head of the 
primary queue. When the pages are touched to set the 
recency and frequency attributes, the left portion of the 
test area is moved back to the head of the primary queue 
while the right portion is shifted down into the secondary 
queue and end of the primary queue. Thus, as blocks are 
evicted, the right portion is evicted first, followed by the 
first blocks of the left portion. Thus, with these queue 
sizes, SFIFO produces a distinctive fingerprint which 
can be used to uniquely identify this policy. 


As P increases, SFIFO behaves more like LRU. When 
P > 0.5 the fingerprint becomes identical to that of 
LRU, as shown in Figure4. When the secondary queue 
is that large, by the time a page is touched for the sec- 
ond time, it has already progressed into the secondary 
queue. Thus, the fingerprint reveals the LRU behavior 
of the policy and matches the LRU fingerprint. We feel 
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that since Segmented FIFO is used to approximate LRU 
(especially with this high value of P), it is acceptable, 
and even appropriate, that its fingerprint cannot be dis- 
tinguished from that of LRU. 


3.3. Replacement Policies with Initial State 


The Clock replacement algorithm is a popular ap- 
proach for managing unified file and virtual memory 
caches in modem operating systems, given its ability 
to approximate LRU replacement with a simpler imple- 
mentation. The Clock algorithm is an interesting policy 
to fingerprint because it has two pieces of internal initial 
State: the initial position of the clock hand and whether 
or not each use bit is set. Thus, we must ensure that 
Clock can be identified by its fingerprint regardless of 
its initial state. We now describe small modifications to 
our methodology to guarantee this behavior. 

In the basic implementation of Clock, the buffer cache 
is viewed as a circular buffer starting from the current 
position of the clock hand; a single use bit is associated 
with each page frame. Whenever a page is accessed, 
its use bit is set. When a replacement is needed, the 
clock hand cycles through page frames, looking for a 
frame with a cleared use bit and also clearing use bits as 
it inspects each frame. Thus, Clock approximates LRU 
by replacing pages that do not have their use bit set and 
have not been accessed for some time. 

Since Clock treats the buffer cache as circular, the ini- 
tial position of the clock hand does not affect our current 
fingerprint. The initial position of the clock hand sim- 
ply determines where the first block of the test region is 
placed. Since all subsequent actions are relative to this 
initia] position, this positionis transparent to Dust. Thus, 
we do not need to modify our fingerprinting methodol- 
ogy to account for hand position. 

However, the state of the use bits does impact our fin- 
gerprint. Depending upon the fraction of set use bits, U, 
the Clock fingerprint can look like FIFO or LRU. Specif- 
ically, when U is near the two extremes of 0 or 1, the 
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Figure 5: Fingerprints of the Clock Replacement Policy. To 
identify Clock, the basic fingerprinting algorithm ts run twice. 
The first time it is run after the use bits have been all set; in 
this case, Clock behaves identically to FIFO as shown in the 
graph on the left. The second time it is run after halfof the use 
bits have been set; in this case, Clock has the same finger print 
as LRU, as shown in the graph on the right. 


fingerprint looks like FIFO; when U is near 0.5, the fin- 
gerprint looks like LRU. We now describe the intuition 
behind this behavior. 

In the simplest case, when U = 0, each frame starting 
with the clock hand is allocated to sequential pages of 
the test region. As a result, the clock hand wraps back 
to the beginning of the buffer cache after this allocation 
and as Dust touches each page to set attributes, the use 
bit of every page is set. During eviction, the first pages 
of the test region are replaced, matching both the behav- 
ior and fingerprint of a FIFO policy. Note that U = 1 
results in identical behavior, except the clock hand must 
first sweep through all frames clearing use bits before it 
allocates the test region sequentially. 

When U = 0.5, the left and right portions of the test 
region data are randomly interleaved in memory. This 
interleaving occurs because pages are allocated in two 
passes. In the first pass, those frames with cleared use 
bits are allocated to the left-hand portion of the test re- 
gion; the use bits of these frames are then set and the 
use bits of the remaining frames are cleared. In the sec- 
ond pass, the remaining frames are allocated to the right- 
hand portion of the test region. In the accesses to set the 
locality and frequency attributes of the pages, the use 
bits of all frames are again set. Thus, when the evic- 
tion phase begins, the first half of pages from both the 
left and right portions of the test region are replaced. If 


the frames with set use bits are uniformly distributed, 
this coincidentally matches the evictions of the LRU pol- 
icy. If the distribution of use bits were not uniform, the 
fingerprint would show those blocks whose frames had 
their use bits initially clear as having been replaced. We 
consider the case where they are uniformly distributed as 
this provides a consistent and recognizable fingerprint. 
Thus, to identify Clock, Dust brings the initial state 
of the use bits into each of these two configurations and 
observes the resulting two fingerprints. The following 
steps can be followed to configure the use bits from out- 
side of the OS. Dust sets all of the use bits (i.e., UV = 1) 
by allocating a warmup region of pages that fills the en- 
tire buffer cache and then touching all pages again (with 
no intervening allocations) so that their use bits are set. 
Setting half of the use bits (i.e., U = 0.5) is slightly 
more complex. The first step is to set all the use bits as in 
the previous scenario. In the second step, Dust allocates 
a few more pages to the warmup region; since all of the 
reference bits are set at this point, the clock hand must 
pass through the entire buffer cache, clearing all of the 
reference bits, to find a page to evict. The final step is to 
randomly touch half of the pages, setting their use bits. 
In this way, Dust can configure the state of the use bits. 
In summary, we modify Dust slightly to account for 
internal state. Before running any fingerprint, Dust first 
allocates the warmup region, which has the effect of set- 
ting use bits if the replacement policy implements them. 
If the resulting fingerprint looks like FIFO, then Dust 
runs again with half the use bits set. If the fingerprint 
still looks like FIFO, then we conclude that there are no 
use bits and the underlying policy is FIFO. If the second 
fingerprint looks like LRU, we conclude that Clock is the 
underlying policy. The result of running these two steps 
on the Clock replacement policy is shown in Figure 5. 


3.4 Replacement Policies with History 


We now show that Dust is able to distinguish those re- 
placement policies that use long-term history from those 
that do not. We begin by briefly showing that the poli- 
cies examined above (FIFO, LRU, LFU, Random, Seg- 
mented FIFO, and Clock) do not use history. We then 
discuss in more detail the behavior of those policies 
(LRU-K and 2Q) that do use history. 

Fig ure 6 shows the long-term fingerprints of three rep- 
resentative policies that do not use history. The graph 
on the left is that for LRU; FIFO, LFU, and Segmented 
FIFO look identical and are not shown. The graph shows 
the results of probing the hot and cold regions of the 
test data. As expected, the hot data has been entirely 
evicted, as shown by its high probe times; although the 
initial portion of the cold data is also evicted due to the 
size of the eviction region, the cold data is clearly pre- 
ferred by these policies. The middle graph shows that 
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Figure 6: History Fingerprint of Short-term Policies. Probes are performed on only pages in the hot (i.e., the blocks on the 
left) and cold (i.e., the blocks on the right) test regions. The graph on the left shows the finger print for FTFO, LRU, LFU, and 
Segmented FIFO. Since the cold test region remains in the buffer cache, these policies do not prefer pageswith history. The graph 
in the middle shows that Randomalso has no preference for pages with history and thus does not use history. Finally, the graph on 
the right shows that the historical fingerprint of Clock is ambiguous if the use bits are not set; after the use bits have been properly 


set, the fingerprint is identical to leftmost graph. 
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Figure 7: Fingerprints of LRU-2. The first graph shows the short-term fingerprint of LRU-2 when the correlated reference count 
is setto zero; inthis case, LRU-2 displaces those pages with a frequency count less than 2 and those whose second-to-last reference 
is the oldest. The second graph shows the short-term fingerprint of LRU-2 when the correlated reference count is increased; here, 
no pages in the eviction with a frequency count higher than two are evicted. Finally, the last graph shows the history fingerprint 


of LRU-2, verifying that it prefers the hot pages. 


Random has no preference for either hot or cold data. 
Finally, the graph on the right shows that the historical 
behavior of Clock is difficult to determine when the use 
bits are not explicitly controlled. In this graph, the use 
bits are set to U = 0.5; as a result, the hot and cold 
regions are interleaved in the file buffer and then each 
region is replaced sequentially. To illustrate that Clock 
does not use history, Dust must again ensure that the use 
bits are all first cleared (or set); with this initialization 
step, the history fingerprint of Clock is identical to the 
first graph in the figure. Thus, FIFO, LRU, LFU, Seg- 
mented FIFO, Random, and Clock do not use history in 
making replacements. 


The LRU-K replacement policy was introduced by the 
database community to address the problem that LRU is 
not able to discriminate between frequently and infre- 
quently accessed pages [19]. The idea behind LRU-K 
is that it tracks the A -th reference to each page in the 
past, and replaces the page with the oldest K-th refer- 
ence (or a page that does not have a K-th reference); 
thus, traditional LRU is equivalent to LRU-1. Given that 
Ke = 2 exhibits most of the benefits of the general case, 
and is the most commonly used value, we only consider 
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LRU-2 further. LRU-2 is sensitive to another parame- 
ter as well, the correlated reference period, C;; the intu- 
ition 1s that accesses to a page within this period should 
not be counted as distinct references. Since setting C’ 
correctly is a non-trivial task, the default value for C' is 
zero. Given that LRU-2 is complex, we note that our 
implementation is derived from the version provided by 
the original authors {20]. 

We begin by briefly exploring the sensitivity of LRU- 
2 to the correlated reference period; the short-term fin- 
gerprints of LRU-2 are shown in the first two graphs of 
Figure 7. When C' = 0 (i.e., the default value) the re- 
sulting fingerprint is a variation of pure LRU, as shown 
in the left-most graph. Specifically, the last stripe of the 
test region 1s evicted with LRU-2; since this stripe was 
accessed only twice, its second-to-last reference is very 
old (i.e, when the page was initially referenced). As the 
correlated reference period is increased such that C' > 0, 
the fingerprint looks more similar to LFU, as shown in 
the middle graph. With this setting, pages in the evic- 
tion region are classified as having only correlated ref- 
erences and thus replace mostly themselves; thus, all of 
those pages that have a frequency count greater than two 
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Figure 8: Fingerprints of 2Q. The first finger print of 2Q shows that the short-term replacement policy used is FIFO. The second 
fingerprint shows that 2Q uses history, preferring pages that have been accessed and then evicted. The third fingerprint shows that 


the replacement policy used for pages in the main queue is LRU, 


are kept in memory. Finally, when C’' is very large, all 
accesses are treated as correlated and thus no pages have 
a second-to-last reference; in this case the behavior de- 
generates to pure LRU (not shown). In summary, LRU-2 
produces a distinctive fingerprint that uniquely identifies 
it and also indicates the approximate setting of the cor- 
related reference period. 

Next, we verify that LRU-2 uses history. The last 
graph in Figure 7 shows the historical fingerprint of 
LRU-2. As desired, the hot region 1s given preference 
over data in the cold region; this occurs because the 
second-to-last reference of pages in the hot region is 
more recent than the second-to-last reference to those 
in the cold region. Further, when a replacement must be 
made within the hot region, those with the oldest second- 
to-last reference are chosen. 

The 2Q algorithm was proposed as a simplification to 
LRU-2 with less run-time overhead yet similar perfor- 
mance [12}. The basic intuition behind 2Q is that instead 
of removing cold pages from the main buffer, it only ad- 
mits hot pages to the main buffer. Thus, the buffer cache 
is divided into two buffers, a temporary queue for short- 
term accesses, Alin which is managed with FIFO, and 
the main buffer, Am, which is managed with LRU. Pages 
are initially admitted into the Al in queue and only after 
they have been evicted and reaccessed are they admitted 
into Am. Thus, 2Q has another structure to remember 
the pages that have been accessed but are no longer in 
the buffer cache, Alout. In our experiments, we set 
Alin to use 25% of the buffer cache (with Am using the 
other 75%); Alout is able to remember a number of 
past references equal to 50% of the number of pages in 
the cache. 

We show the fingerprints for 2Q in Figure 8. The 
first graph shows that the short-term fingerprint of 2Q 
is identical to FIFO. Given that the Alin queue is man- 
aged with FIFO and the short-term fingerprint does not 
access pages after they have been evicted, this is the ex- 
pected result. However, 2Q can be easily distinguished 
from pure FIFO from observing the history fingerprint 


shown in the second graph. In the historical fingerprint, 
we can see that the hot region remains entirely in the 
buffer cache, since these are the only accesses that are 
moved to the Am buffer. Finally, we are able to iden- 
tify the replacement policy employed by the long-term 
buffer, Am, by setting the initial access, recency, and fre- 
quency attributes of the hot region and then forcing evic- 
tions from it. Since this methodology is more specific to 
the 2Q replacement policy, we do not describe it in more 
detail. This fingerprint is shown as the last graph of Fig- 
ure 8 and correctly identifies the LRU policy of the Am 
buffer. We note that for LRU-2 or other policies that use 
history, a similar technique could be used to determine 
the replacement strategy of the long-term queue. How- 
ever, explicitly setting the state of the long-term queue 
requires knowledge of the policy of the short-term queue 
and the policy for moving a block from one queue to the 
other. Hence a fingerprinting technique for the long-term 
queue is by nature specific to the policy of the short-term 
queue. 


3.5 Sensitivity to Buffer Size Estimate 


In our last set of experiments we verify the robustness 
of Dust to inaccuracies in its estimate of the size of the 
buffer cache. If the estimate of the buffer cache size is 
significantly different than its actual value, then the re- 
sulting fingerprints are not identifiable. If the estimate 
of the cache is much too small, then Dust does not touch 
enough pages to force evictions to occur; if the estimate 
is much too large, then Dust evicts the entire region. 

The short-term fingerprint is more sensitive to this es- 
timate than the historical fingerprint: in the short-term 
fingerprint we must observe the presence or absence of 
stripes that use only 1/10th of the buffer cache, whereas 
in the historical fingerprint we must observe a hot or 
cold region that uses half of the buffer cache. However, 
as Figure 9 shows, the short-term fingerprint of LRU is 
distinguishable even with estimates that are either 20% 
under or over the real sizes. The other replacement poli- 
cies, with the exception of Clock, are robust to a similar 
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Figure 9: Sensitivity of LRU Fingerprint to Cache Size Estimate. These graphs show the short-term finger prints of LRU as 
the estimate of the size of the baffer cache is varied. In the first graph the estimate is too high by 20%, in the second graph the 
estimate is perfect, and in the third graph the estimate is too low by 20%. However, all fingerprints still uniquely identify LRU. 
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Figure 10: Sensitivity of Clock Fingerprint to Cache Size Estimate. These graphs show the short-term finger prints of Clock 
with half of the use bits set as the estimate of the size of the buffer cache is varied. With U = 0.5, Clock is expected to look like 
LRU. In the first graph the estimate is too high by 10%, in the second graph the estimate is perfect, and in the third graph the 
estimate is too low by 10%. Thus, the Clock finger print is not as robust to inaccuracies in this estimate as the other algorithms. 


degree. 

The Clock replacement algorithm is more sensitive to 
this estimate due to our need to configure the state of 
the use bits. Specifically, the size of the warm-up region 
used by Dust to fill the buffer cache must be accurate as 
well. Figure 10 shows that Dust is still reasonably tol- 
erant to errors in cache-size estimate when identifying 
Clock but not as robust as when identifying other algo- 
rithms. 


4 Platform Fingerprints 


Buffer caching in modern operating systems is often 
much more complex than the simple replacement poli- 
cies described in operating systems textbooks. Part of 
this complexity is due to the fact that the filesystem 
buffer cache is integrated with the virtual memory sys- 
tem in many current systems; thus the amount of mem- 
ory dedicated to the buffer cache can change dynami- 
cally based on the current workload. To control this ef- 
fect, Dust minimizes the amount of virtual memory that 
it uses, and thus tries to maximize the amount of memory 
devoted to the file buffer cache. Further, we run Dust on 
an otherwise idle system to minimize disturbances from 
competing processes . 

In this section, we describe our experience fin- 
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gerprinting three Unix-based operating systems: 
NetBSD 1.5, Linux 2.2.19 and 2.4.14, and Solaris 2.7. 
As we will see, the fingerprints of real systems contain 
much more variation than those of our simulations. In 
addition to fingerprinting the replacement policy of the 
buffer cache, Dust also reveals the cost of a hit versus 
a miss in the buffer cache, the size of the buffer cache, 
and whether or not the buffer cache is integrated with 
the virtual memory system. 

Dust takes a considerable amount of time to run on 
a real system. Generating a sufficient number of data 
points requires running many iterations of test scan, 
eviction scan, and probes. In our experiments we always 
allowed at least 300 iterations. We found that one itera- 
tion can take anywhere from 30 seconds to three minutes 
depending on the system under test. Note that systems 
with smaller buffer caches can be tested in a shorter pe- 
riod of time since the test region becomes smaller. We 
feel this relatively long running time 1s acceptable since, 
for any given system configuration, Dust need only be 
run once; the results can be stored and made available to 
applications and programmers. 

All of the experiments described in the section were 
run on systems with dual Pentium III-Xeon processors, 
1 GB of physical RAM and a SCSI storage subsystem 
with Ultra2, 10000 RPM disks. 
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Figure 11: Fingerprints of NetBSD 1.5. The first graph 


shows the short-term fingerprint of NetBSD, indicating the 
LRU replacement policy. The second graph shows the long- 
term finger print, indicating that history ts not used. 


4.1 NetBSD 1.5 


Given that NetBSD 1.5 [16] has the most straight- 
forward replacement policy of the systems we have ex- 
amined, we begin with its fingerprint, shown in Fig- 
ure 11. As in the simulations, we examine both short- 
term and long-term fingerprints. The first graph in Fig- 
ure | 1 shows the expected pattern for pure LRU replace- 
ment; given that Dust produces this same finger print re- 
gardless of whether it attempts to manipulate use bits, 
we can infer that NetBSD implements strict LRU, and 
not Clock. This conclusion is further verified by the sec- 
ond graph of Figure 11 showing that NetBSD does not 
use history. Documentation [16] and inspection of the 
source code [17] confirm our finding. 


From the fingerprints we can also infer other param- 
eters. Specifically, we can see that the time for reading 
a byte from a page in the buffer cache is on the order 
of 10 ys, whereas the time for going to disk varies be- 
tween about | ms and 10 ms. Further, even on this ma- 
chine with | GB of physical memory, NetBSD devotes 
only about 50 MB to the buffer cache (most easily shown 
by the fact that the history fingerprint devotes this much 
memory to the hot and cold regions); this allows us to in- 
fer that the file buffer cache is segregated from the VM 
system. 
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Figure 12: Fingerprints of Linux 2.2.19. The first graph 


shows the short-term finger print of Linux 2.2.19 when the use 
bits are all set; the second graph shows the finger print when 
the use bits are untouched. 


4.2 Linux 2.2.19 


Linux 2.2.19 is a very popular version of the Linux 
kernel in production environments. In Section 5 we will 
run the NeST web server on top of this OS; thus, it is 
important for us to understand this finger print. 


The short-term fingerprint of Linux 2.2.19 is shown 
in Figure 12. The graph on the left shows the results 
when Dust attempts to set all of the use bits. Since this 
graph looks like FIFO, we must investigate further to 
determine if Clock is actually being used. The graph 
on the right shows the fingerprint when the use bits are 
left in a random state. Although this fingerprint is very 
noisy, one can see that priority is given to pages that are 
most recently referenced (i.e., pages near the second and 
fourth quarters); further, after filtering the data, we are 
able to verify that more pages in the first and third quar- 
ters are out of cache than in cache. Thus, this finger print 
is similar to the LRU fingerprint expected for a Clock- 
based replacement algorithm. Examination of the source 
code and documentation confirms that the replacement 
policy is Clock based [15, 34]. Finally, since the buffer 
cache size is very close to the amount of physical RAM 
in the system, we conclude a buffer cache that is inte- 
grated with the VM. 
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4.3 Linux 2.4.14 


The memory management system within Linux un- 
derwent a large revision between version 2.2 and 2.4, 
thus we see a very different fingerprint for Linux 2.4.14, 
which uses a more complex replacement scheme than 
either Linux 2.2.19 or NetBSD. The short-term finger- 
print, shown as the first graph in Figure 13, suggests that 
Linux 2.4 uses both a recency and frequency component, 
and does not use Clock. Further, the second graph of 
Dust shows that Linux 2.4 does use history in its deci- 
sion. 

Examination of the Linux 2.4.14 source code and ex- 
isting documentation confirms these results [15, 34]. 
Linux maintains two separate queues: an active and 
an inactive list. When memory becomes scarce, Linux 
shrinks the size of the buffer cache. In doing this, pages 
that have not been recently referenced (as indicated by 
their reference bit) are moved from an active list to an 
inactive list. The inactive list is scanned for replace- 
ment victims using a form of page aging, in which an 
age counter is kept for each frame, indicating how desir- 
able it is to keep this page in memory. When scanning 
fora page to evict, the page age is decreased as it is con- 
sidered for eviction; when the page age reaches zero, the 
page is a candidate for eviction. The age is incremented 
whenever the page is referenced. 


4.4 Solaris 2.7 


Solaris presented us with the greatest challenge of the 
platforms we studied. The VM subsystem of Solaris 
has not been thoroughly studied; it is believed to use 
a two-handed, global Clock algorithm [7], but some re- 
searchers have noted non-intuitive behavior [3]. In two- 
handed Clock, one hand clears reference bits while the 
second hand follows some fixed distance behind, select- 
ing a page for replacement if its reference bit is still clear. 
The hands are advanced in unison such that once the ref- 
erence bit on a page is cleared, it has some opportunity 
to be re-referenced before it is a candidate for eviction. 
When implemented in our simulator, the fingerprint of 
two-handed Clock looks identical to FIFO (not shown). 

The short-term fingerprint of Solaris 2.7 is shown in 
the first graph of Figure 14. The out-of-cache areas on 
both the far right and left of the fingerprint strongly sug- 
gests that Solaris is using a frequency (or aging) com- 
ponent in its eviction decision in addition to Clock. The 
second graph of Figure 14 shows the historical finger- 
print for Solaris. Though the data is again noisy, it shows 
a clear preference for the hot region, again suggesting 
that history or page aging is also used in Solaris. The 
fingerprint also shows that the time to service a buffer 
cache hit is significantly higher in Solaris than in Linux. 
The fingerprint shows a hit time of over 10 21s, whereas 
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Figure 13: Fingerprints of Linux 2.4.14. The first graph 


shows the short-term fingerprint of Linux 2.4.14, indicating 
that a combination of LRU and LFU ts used.. The second graph 
shows the long-term fingerprint, indicating that history is used. 


the hit time for Linux 2.4 on the same platform is under 
10 ps. 


5 Cache-Aware Web Server 


In this section, we describe how knowledge of the 
buffer cache replacement algorithm can be exploited to 
improve the performance of a real application. We do 
so by modifying a web server to re-order its accesses to 
first serve requests that are likely to hit in the file system 
cache, and only then serve those that are likely to miss. 
This idea of handling requests in a non-FIFO service or- 
der is similar to that introduced in connection scheduling 
web servers [9]; however, whereas that work scheduled 
requests based upon the size of the request, we sched- 
ule based upon predicted cache content. As we will see, 
re-ordering based on cache content both lowers average 
response time (by emulating a shortest-job first schedul- 
ing discipline) and improves throughput(by reducing to- 
tal disk traffic). 


5.1 Approach 


The key challenge in implementing the cache-aware 
server is to use our gray-box knowledge of the file 
caching algorithm to determine which files are in the 
cache. By keeping track of the file access stream be- 
ing presented to the kernel, the web server can simulate 
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Figure 14: Fingerprint of Solaris 2.7. The first graph shows 
the short-term finger print of Solaris; the second graph shows 
the history fingerprint. 


the operating system’s buffer cache and thus predict at 
any given time what data is in cache. We term this al- 
gorithmic mirroring, and believe that it is a general and 
powerful manner in which to exploit gray-box knowl- 
edge. 

One important assumption of algorithmic mirroring 
is that the application induces most or all of the traffic 
to the file system, and thus the mirror cache is likely 
to accurately represent the state of the real OS cache. 
Although this assumption may not hold in the general 
case within a multi-application environment, we believe 
it is feasible when a single application dominates all file- 
system activity. Server applications such as a web server 
or database management system are thus a perfect match 
for such mirroring methods. 

The NeST storage appliance [6] supports HTTP as 
one of its many access protocols. NeST allows a con- 
figurable number of requests to be serviced simultane- 
ously. Any requests received beyond that number are 
queued until one of the pending requests completes, By 
default, NeST services queued requests in FIFO order. 
We term this default behavior as cache-oblivious NeST. 

We have modified the NeST request scheduler to keep 
a model of the current state of the OS buffer cache. The 
model is updated each time a request 1s scheduled. NeST 
bases its model of the underlying file cache on the algo- 
rithm exposed by Dust. NeST uses this model to reorder 


requests such that those requests for files believed to be 
in cache are serviced first. Note that NeST does not per- 
form caching of files itself, but relies strictly upon the 
OS buffer cache. 

For the cache mirror to accurately reflect the internal 
state of the OS, NeST must have a reasonable estimate 
of the cache size. In our current approach, NeST uses 
the static estimate produced by Dust; the disadvantage 
of this approach is that this estimate is produced with- 
out contention with the virtual memory system, and thus 
may be larger than the amount available when the web 
server 1s actually running. To increase the robustness of 
our estimate, we plan to modify NeST to dynamically 
estimate the size of the buffer cache by measuring the 
time for each file access. If the time is “low’’, the file 
must have been in the cache, and if it is “high”, the file 
was likely on disk. By comparing these timings with 
the prediction provided by the mirror cache, NeST can 
adjust the size of the mirror cache. 


5.2 Performance 


To evaluate the performance benefits of cache-aware 
scheduling, we compare the performance of cache- 
aware NeST to cache-oblivious NeST for two different 
workloads. In all tests, the web server is run on a dual 
Pentium IJ-Xeon machine with 128 MB of main mem- 
ory and Ultra II disks. For clients, we use four machines 
(identical to the server, except containing 1 GB of main 
memory) each running 36 client threads. The clients are 
connected to the server with Gigabit Ethernet. 

The server and clients are running Linux 2.2.19, 
which was shown in Section 4.2 to use the Clock re- 
placement algorithm; therefore, cache-aware NeST 1s 
configured to model the Clock algorithm as well. In 
our configuration, the server has approximately 80 MB 
of memory dedicated to the buffer cache. In our ex- 
periments, we explore the performance of cache-aware 
NeST as we vary its estimate of the size of the buffer 
cache. 

In our first experiment, we consider a workload in 
which each client thread repeatedly requests a random 
file from a set of 200 1 MB files. Figures 15 and 
16 show the average response time and throughput, re- 
spectively for three different web servers: the Apache 
web server { ]], cache-oblivious NeST, and cache-aware 
NeST as a function of its estimate of cache-size. We 
begin by comparing the response time and the through- 
put of NeST and Apache; from the two figures, we see 
that although NeST incurs some overhead for its flexible 
structure (e.g., NeST can handle multiple transfer pro- 
tocols, such as FIP and NFS), it achieves respectable 
performance as a web server and is a reasonable plat- 
form for studying cache-aware scheduling. Second, and 
most importantly, adding cache-aware scheduling signif- 
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Figure 15: Response Time as a Function of Cache Size 


Estimate. Response time in cache-awareNeST its lowest when 
the estimate of cache size is closest to the true size of the cache. 


icantly improves both the response time and the through- 
put of NeST. By first servicing requests that hit in the 
cache, cache-aware scheduling improves average re- 
sponse time by servicing short requests first. More dra- 
matically, cache-aware scheduling improves throughput 
by reducing the number of disk reads (verified through 
the /proc interface): in-cache requests are handled be- 
fore their data is evicted from the cache. Finally, the per- 
formance of cache-aware NeST improves when its esti- 
mate of the cache size is closer to the real value, but is 
robust to a large range of cache size estimates. 

In our second experiment, we consider a workload 
created by the SURGE HTTP workload generator [5]. 
The SURGE workload uses approximately 12,000 dis- 
tinct files with sizes taken from a Zipf distribution with 
a mean of approximately 21 KB. SURGE is thus a more 
representative web workload than is presented above. 

With the SURGE workload, we measure qualitatively 
similar results to those above, except with two main 
differences. First, the performance of cache-oblivious 
NeST relative to Apache degrades slightly more; for ex- 
ample, the average response time for cache-oblivious 
NeST is 0.80 seconds and for Apache is 0.65 seconds. 
This result 1s expected, given that NeST is designed 
for staging data in the Grid, and is thus optimized for 
large files and not the small files more typical in web 
workloads. Second, the performance of cache-aware 
NeST is not as sensitive to its estimate of the cache size; 
for example, performance improves from 4.27 MB/s to 
4.69 MB/s (approximately 10%) as the cache size es- 
timate is improved from 10 MB to 80 MB. Apache 
achieves 4.91 MB/s. In the future, we plan to experi- 
ment with other web servers and workloads. 


6 Related Work 


The idea of using algorithmic knowledge of the under- 
lying operating system to improve performance has been 
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Figure 16: Sensitivity to Cache Estimate Accuracy. The 
performance of cache aware NeST improves as the estimate of 
cache size approaches the true size of the buffer cache. The 
buffer cache is approximately 80 MB. Cache-oblivious NeST 
and Apache are shown for comparison. 


recently explored in the context of gray-box systems [3]. 
This work showed that an “OS-like” service can be 1m- 
plemented as an /nformation and Control Layer (ICL) 
outside of the OS, given algorithmic knowledge of the 
OS, probes of the OS, and statistical analysis. However, 
no concrete solutions were proposed for how developers 
of ICLs can obtain this algorithmic knowledge. In this 
paper, we show that fingerprinting can obtain this gray- 
box knowledge in a simple and automatic manner. 


Fingerprinting system components to determine their 
behavior is not new and has been used successfully 
in other contexts, notably in networking and storage. 
Specifically, fingerprinting has been used to uncover key 
parameters within the TCP protocol and to identify the 
likely OS of a remote host [11, 21]. The primary differ- 
ence between fingerprinting within TCP and in our con- 
textis that weare trying to identify policies that can have 
arbitrary behavior, rather than implementations that are 
expected to adhere to given specifications. In [25, 35] 
techniques similar to those used in Dust were used to de- 
termine various characteristics of disks, such as size of 
the prefetch window, prefetching algorithm and caching 
policy. 

Fingerprinting also shares much in common with 
microbenchmarking. Specifically, both perform requests 
of the underlying system in order to characterize its be- 
havior. For example, with simple probes in microbench- 
marks, one can determine parameters of the the memory 
hierarchy [2, 24], processor cycle time [28], and charac- 
teristics of disk geometry (26, 30]. In our view, the key 
difference between fingerprinting and microbenchmark- 
ing is that a fingerprint is used to discover the policy 
or algorithm employed by the underlying layer, whereas 
a microbenchmark is typically used to uncover specific 
system parameters. 

The idea of discovering characteristics of lower layers 
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ofasystem and using that knowledge in higher layers to 
improve performance is not new. In traxtents [26] the 
file system layer of the operating system was modified 
to avoid crossing disk track boundaries so as to mini- 
mize the cost incurred due to head switching and exploit 
“zero-latency” access. Yu, etal. developed a method of 
predicting the position of the disk head withouthard ware 
support and used that information to determine which of 
several rotational replicas to use to service a given re- 
quest [36], thus giving software expanded knowledge of 
hardware state. 

Our approach involves informing the application of 
the buffer cache replacement policy in use by the oper- 
ating system. SLEDs [33] and dynamic sets [29] seek to 
increase the knowledge that the application and operat- 
ing system have of each other. Both take the approach 
of embellishing the interface between the OS and the ap- 
plication to allow the explicit exchange of certain types 
ofinformation. In the case of dynamic sets, the applic a- 
tion has the ability to provide more knowledge about its 
future access patterns. This allows the OS to reorder the 
fetching of data to improve cache performance. SLEDs 
allows the OS to export perform ance data to the appli- 
cation, enabling the application to modify its workload 
based on the perform ance characteristics of the underly- 
ing system. 

The idea of servicing requests within a web server ina 
particular order was explored in connection-scheduling 
web servers [9]. The main thesis of that research is that 
better performance can be obtained by controlling the 
scheduling of requests within the web server, rather than 
with the OS. While their approach used static file size 
to schedule requests, cache-aw are NeST uses a dynamic 
estimate of the contents of the buffer cache. In future 
work, we hope to investigate the interactions of schedul- 
ing requests based on both file size and cache content. 

Our cache-aware web server has similarities to 
locality-aware request distribution (LARD) cluster- 
based web servers [22]. In LARD, the front-end node 
directs page requests to a specific back-end node based 
upon which back-end has most recently served this page 
(modulo load-balancing constraints); thus, the front-end 
has a simple model of the cache contents of each back- 
end and tries to improve their cache hit rates. Our ap- 
proaches are complementary, as LARD partitions re- 
quests across different nodes, whereas we use cache con- 
tent to service requests in a different order on a single 
node. 


7 Conclusions and Future Work 


We have shown that various buffer cache replacement 
algorithms can be uniquely identified with a simple fin- 
gerprint. Our fingerprinting tool, Dust, classifies al- 


gorithms based upon whether they consider initial ac- 
cess, locality, frequency, and/or history when choosing 
a block to replace. With a simple simulator, we have 
shown that FIFO, LRU, LFU, Clock, Random, Seg- 
mented FIFO, 2Q, and LRU-K all produce distinctive 
fingerprints, allowing them to be uniquely identified. We 
have also begun to address the more challenging prob- 
lem of fingerprinting real systems. By running Dust on 
NetBSD, Linux, and Solaris, we have shown that we can 
determine which attributes are considered by each page 
replacement algorithm. Finally, we have shown that the 
algorithmic knowledge revealed by Dust is useful for 
predicting the contents of the file cache. Specifically, 
we have implemented a cache-aware web server that ser- 
vices first those requests that are predicted to hit in the 
file cache, improving both response time and bandwidth. 

In the near future, we would like to extend the range 
of policies which Dust is able to recognize. Speci ft 
cally, we would like to see how adaptive policies such 
as EELRU [27] and LRFU [13] can be identified, as well 
as policies that use other attributes such as the size of a 
page or the cost of replacing a page. In our current sys- 
tem, one must visually interpret the fingerprint graphs 
produced by Dust; we would like to automate this pro- 
cess for the well-known replacement policies. 

In the long-term, we plan to continue exploring fin- 
gerprinting of other subsystems within the OS (e.g., the 
CPU scheduler). We would also like to determine how 
algorithmic knowledge can be used across several user 
processes; the main challenge is performing a model or 
simulation in which access to all OS inputs is not re- 
quired for accuracy. Finally, we are investigating how 
algorithmic knowledge can be used not only to infer the 
contents of the file cache, but to change its contents as 
well. 
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Abstract 


Ths paper describes the architecture and performance 
of the JX operating system. JX is both an operating system 
completely written in Java and a runtime system for Java 
applications. 

Our work demonstrates thatit is possible to builda com- 
plete operating system in Java, achieve a good perfor- 
mance, and still benefit from the modern software-technol- 
ogy of this object-oriented, type-safe language. We explain 
how an operating system can be structured that is no longer 
build on MMU protection but on type sa fety. 

JX is based on a small microkernel which is responsible 
for system initialization, CPU context switching. and low- 
level protection-domain management. The Java code is 
organized in components, which are loaded into domains, 
verified, and translated to native code. Domains can be 
completely isolated from each other: 

The JX architecture allows a wide range of system con- 
figurations, from fast and monolithic to very flexible, but 
slower configurations. 

We compare the performance of JX with Linux by using 
two non-trivial operating system components: a file system 
and an NFS server. Furthermore we discuss the perfor- 
mance impact of several alternative system configurations. 
Ina monolithic configuration JX achieves between about 
40% and 100% Linux performance in the file system bench- 
mark and about 80% in the NFS benchmark. 


1 Introduction 


The world of software production has dramatically 
changed during the last decades from pure assembler pro- 
gramming to procedural programming to object-oriented 
programming. Each step raised the level of abstraction and 
increased programmer productivity. Operating systems, on 
the other hand, remained largely unaffected by this process. 
Although there have been attempts to build object-oriented 
or object-based operating systems (Spring [27], Choices 
(10}, Clouds [17]) and many operating systems internally 
use object-oriented concepts, such as vnodes [31], there isa 
growing divergence between application programming and 
operating system programming. To close this semantic gap 
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between the applications and the OS interface a large mar- 

ket of middleware syste ms has emerged over the last years. 

While these systems hide the ancient nature of operating 

systems, they introduce many layers of indirection with sev- 

cral performance problems. 

While previous object-oriented operating systems dem- 
ons trated that it is possible and beneficial to use object-ori- 
entation, they also made it apparent that it is a problem 
when implementation technology (object orientation) and 
protection mechanism (address spaces} mismatch. There 
are usually fine-grained “language objects” and large- 
grained “protected objects”. A well-known project that tried 
to solve this mismatch by providing object-based protection 
in hardware was the Intel 1APX/432 processor [37]. While 
this project 1s usually cited as a failure of object-based hard- 
ware protection, an analysis [14] showed that with a slightly 
more mature hardware and compiler technology the 1APX/ 
432 would have achieved a good performance. 

We believe that an operating system based on a dynam- 
ically compiled, object-oriented intermediate code, such as 
the Java bytecode, can outperform traditional systems, 
because of the many compiler optimizations (i) that are onl y 
possible at a late time (e.g., inlining virtual calls) and (11) 
that can be applied only when the system environment is 
exactly known (e.g., cache optimizations [12]). 

Using Java as the foundation of an operating system Is 
attractive, because of its widespread use and features, such 
as interfaces, encapsulation of state, and automatic memory 
management, that raise the level of abstraction and help to 
build more robust s oftware in less time. 

To the best of our knowledge JX is the first Java operat- 
ing system that has all of the following properties: 
¢ The amount of C and assembler code is minimal to sim- 

plify the system and make it more robust. 

* Operating system code and application code is separated 
in protection domains with strong isolation between the 
domains. 

* The code is structured into components, which can be col- 
located in a single protection domain or dislocated in sep- 
arate domains without touching the component code. This 
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reusability across configurations enables to adapt the 
system for its intended use, which may be, for example, 
an embedded system, desktop workstation, or server. 

¢ Performance is in the 50% range of monolithic UNIX 
performance for computational-intensive OS opera- 
tions. The difference becomes even smaller when I/O 
from a real device 1s involved. 

Besides describing the JX system, the contribution of 
this paper consists of the first performance comparison 
between a Java OS and a traditional UNIX OS using real 
OS operations. We analyze two costs: (1) the cost of using 
a type-safe language, like Java, as an OS implementation 
language and (11) the cost of extensibility. 

The paper is structured as follows: In Section 2 we 
describe the architecture of the JX system and illustrate 
the cost of several features using micro benchmarks. Sec- 
tion 3 describes two application scenarios and their per- 
formance: a file system and an NFS server. Section 4 
describes tuning and configuration options to refine the 
system and measures their effect on the performance of 
the file system. Section 5 concludes and gives directions 
for future research. 


2 JX System Architecture 


The majority of the JX system 1s written in Java. A 
small microkernel, written in C and assembler, contains 
the functionality that can not be provided at the Java level 
(system initialization after boot up, saving and restoring 
CPU state, low-level protection-domain management, 
and monitoring). 

Figure | shows the overall structure of JX. The Java 
code is organized in components (Sec. 2.4) which are 
loaded into domains (Sec. 2.1), verified (Sec. 2.6), and 
translated to native code (Sec. 2.6). Domains encapsulate 
objects and threads. Communication between domains 1s 
handled by using portals (Sec. 2.2). 
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Figure 1: Structure of the JX system 
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The microkernel runs without any protection and 
therefore must be trusted. Furthermore, a few Java com- 
ponents must also be trusted: the code verifier, the code 
translator, and some hardware-dependent components 
(Sec. 2.7). These elements are the minimal trusted com- 
piting base [19] of our architecture. 


2.1 Domains 


The unit of protection and resource management is 
called a domain. All domains, except DomainZero, con- 
tain 100% Java code. 

DomainZero contains all the native code of the JX 
microkernel. It is the only domain that can not be termi- 
nated. There are two ways how domains interact with 
DomainZero. First, explicitly by invoking services that 
are provided by DomainZero. One of these services is a 
simple name service, which can be used by other 
domains to export their services by name. Secondly, 
implicitly by requesting support from the Java runtime 
system; for example, to allocate an object or check a 
downcast. 

Every domain has its own heap with its own garbage 
collector (GC). The collectors run independently and 
they can use different GC algorithms. Currently, domains 
can choose from two GC implementations: an exact, 
copying, non-generational GC or a compacting GC. 

Every domain has its own threads. A thread does not 
migrate between domains during inter-domain commu- 
nication. Memory for the thread control blocks and 
stacks 1s allocated from the domain’s memory area. 

Domains are allowed to share code - classes and 
interfaces - with other domains. But each domain has its 
own set of static fields, which, for example, allows each 
domain to have its own System. out stream. 


2.2. Portals 


Portals are the fundamental inter-domain communi- 
cation mechanism. The portal mechanism works similar 
to Java’s RMI [43], making it easy for a Java programmer 
to use it. A portal can be thought of as a proxy for an 
object that resides in another domain and 1s accessed 
using remote procedure call (RPC). 

An entity that may be accessed from another domain 
is called service. A service consists of a normal object, 
which must implement a portal interface, an associated 
service thread, and an initial portal. A service is accessed 
via a portal, which is a remote (proxy) reference. Portals 
are capabilities [1 8] that can be copied between domains. 
The service holds a reference counter, which is incre- 
mented each time, the portal is duplicated. A domain that 
wants to offer a service to other domains can register the 
service’s portal at a name server. 
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When a thread invokes a method at a portal, the 
thread 1s blocked and execution 1s continued in the ser- 
vice thread. All parameters are deep copied to the target 
domain. If a parameter is itself a portal, a duplicate of the 
portal is created in the target domain. 

Copying parameters poses several problems. It leads 
to duplication of data, which is especially problematic 
when a large transitive closure is copied. To avoid that a 
domain is flooded with parameter objects, a per-call 
quota for parameter data is used in JX. Another problem 
is that the object identity is lost during copying. 
Although parameter copying can be avoided in a single 
address space system, and even for RPC between address 
spaces by using shared communication buffers [7], we 
believe that the advantages of copying outweigh its dis- 
advantages. The essential advantage of copying is a 
nearly complete isolation of the two communicating pro- 
tection domains. The only time where two domains can 
interfiere with each other is during portal invocation. This 
makes it easy to control the security of the system and to 
restrict information flow. Another advantage of the copy- 
ing semantics is, that it can be extended to a distributed 
system without much effort. 

In practice, copying posed no severe performance 
problems, because only small data objects are used as 
parameters. Objects with a large transitive closure in 
most cases are server objects and are accessed using por- 
tals. Using them as data objects often is not intended by 
the programmer. 

As an optimization the system checks whether the 
target domain of a portal call is identical to the current 
domain and executes the call as a function invocation 
without thread switch and parameter copy. 

When a portal is passed as a parameter in a portal 
call, it is passed by-reference. As a convenience to the 
programmer the system also allows an object that imple- 
ments a portal interface to be passed like a portal. First it 
is checked, whether this object already 1s associated with 
a service. In this case, the existing portal is passed. Oth- 
erwise, a service is launched by creating the appropriate 
data structures and starting a new service thread. This 
mechanism allows the programmer to completely ignore 
the issue of whether the call is crossing a domain border 
or not. When the call remains inside the domain the 
object is passed as a normal object reference. When the 
call leaves the domain, the object automatically is pro- 
moted to a service and a portal to this service is passed. 

When a portal 1s passed to the domain in which its 
service resides, a reference to the service object is passed 
instead of the portal. 

When two domains want to communicate via portals 
they must share some types. These are at least the portal 
interface and the parameter types. When a domain 


IPC 
(cycles) 


L4Ka (PII], 450MHz) [32] 818 
Fiasco/L4 (PII 450 MHz) [42] 2610 

440 
2/270 


System 


J-Kemel (1.RMI on MS-VM, PPro 200MHz) [28] 
Alta/KaffeOS (PII300 MHz) [5] 
JX (PIII SOOMHz) 650 





Table 1: IPC latency (round-trip, no parameters) 


obtains a portal, it is checked whether the correct inter- 
face is present. 

Each time a new portal to a service Is created a refer- 
ence counter in the service control block is incremented. 
It is decremented when a portal is collected as garbage or 
when the portal’s domain terminates. When the count 
reaches zero the service 1s deactivated and all associated 
resources, such as the service thread, are released. 

Table I shows the cost of a portal invocation and 
compares it with other systems. This table contains very 
different systems with very different IPC mechanisms 
and semantics. The J-Kernel IPC, for example, does not 
even include a thread switch. 


Fast portals. Several portals which are exported by 
DomainZero are fast portals. A fast portal invocation 
looks like a normal portal invocation but is executed in 
the caller context (the caller thread) by using a function 
call - or even by inlining the code (see also Sec. 4.2.2). 
This is generally faster than a normal portal call, and in 
some cases it 1S even necessary. For example, Domain- 
Zero provides a portal with which the current thread can 
yield the processor. It would make no sense to implement 
this method using the normal portal invocation mecha- 
nism. 


2.3 Memory objects 


An operating system needs an abstraction to repre- 
sent large amounts of memory. Java provides byte arrays 
for this purpose. However, arrays have several shortcom- 
ings, that make them nearly unsuitable for our purposes. 
They are not accessed using methods and thus the set of 
allowed operations is fixed. It is, for example, not possi- 
ble to restrict access to a memory region to a read-only 
interface. Furthermore, arrays do not allow revocation 
and subrange creation - two operations that are essential 
to pass large memory chunks without copying. 

To overcome these shortcomings we developed 
another abstraction to represent memory ranges: memory 
objects. Memory objects are accessed like normal 
objects via method invocations. But such invocations are 
treated specially by the translator: they are replaced by 
the machine instructions for the memory access. This 
makes memory access as fast as array access. 
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Memory objects can be passed between domains like 
portals. The memory that 1s represented by a memory 
object is not copied when the memory object is passed to 
another domain. This way, memory objects implement 
shared memory. 

Access to a memory range can be revoked. For this 
purpose all memory portals that represent the same range 
of memory contain a reference to the same central data 
structure in DomainZero. Among other information this 
data structure contains a valid flag. The revocation 
method invalidates the original memory object by clear- 
ing the valid flag and returns a new one that represents 
the same range of memory. Memory is not copied during 
revocation but all memory portals that previously repre- 
sented this memory become invalid. 

When a memory object is passed to another domain, 
a reference counter, which is maintained for every mem- 
ory range, is incremented. When a memory object - 
which, in fact, is a portal or proxy for the real memory - 
is garbage collected, the reference counter 1s decre- 
mented. This happens also for all memory objects of a 
domain that is terminated. To correct the reference 
counts the heap must be scanned for memory objects 
before it is released. 


ReadOnlyMemory. ReadOnlyMemory is equivalent to 
Memory but it lacks all the methods that modify the 
memory. A ReadOnlyMemory object can not be con- 
verted to a Memory object. 


DeviceMemory. DeviceMemory is different from Mem- 
ory in that it is not backed by main memory: It is usually 
used to access the registers of a device or to access mem- 
ory that is located on a device and mapped into the CPU’s 
address space. The translator knows about this special 
use and does not reorder accesses to a DeviceMemory. 
When a DeviceMemory is garbage collected the memory 
is not released. 


2.4 Components 


All Java code that is loaded into a domain 1s orga- 
nized in components. A component contains the classes, 
interfaces, and additional information; for example, 
about dependencies from other components or about the 
required scheduling environment (preemptive, nonpre- 
emptive). 


Reus ability. An overall objective of object orientation 
and object-oriented operating systems is code reuse. JX 
has all the reusability benefits that come with object ori- 
entation. But there is an additional problem in an operat- 
ing system: the protection boundary. To call a module 
across a protection boundary in most operating system is 
different from calling a module inside the own protection 
domain. Because this difference is a big hindrance on the 
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way to reusability, this problem has already been investi- 
gated in the microkernel context [22]. 

Our goal was a reuse of components in different con- 
figurations without code modifications. Although the 
portal mechanism was designed with this goal the pro- 
grammer must keep several points in mind when using a 
portal. Depending on whether the called service is 
located inside the domain or in another domain there are 
a few differences in behavior. Inside a domain normal 
objects are passed by reference. When a domain border 
is crossed, parameters are passed by copy. To write code 
that works in both settings the programmer must not rely 
on either of these semantics . For example, a programmer 
relies on the reference semantics when modifying the 
parameter object to return information to the caller; and 
the programmer relies on the copy semantics when mod- 
ifying the parameter object assuming this modification 
does not affect the caller. 

In practice, these problems can be relieved to a cer- 
tain extent by the automatic promotion of portal-capable 
objects to services as described in Section 2.2. By declar- 
ing all objects that are entry points into a component as 
portals a reference semantics is guaranteed for these 
objects. 


Dependencies. Components may depend on other com- 
ponents. We say that component B has an implementa- 
tion dependence on component A, if the method imple- 
mentations of B use classes or interfaces from A. Com- 
ponent B has an interface dependence on component A 
if the method signatures of B use classes or interfaces 
from A or if aclass/interface of B is a subclass/subinter- 
face of a class/interface of A, or if a class of B imple- 
ments an interface from A, or if a non-private field of a 
class of B has as its type a class/interface from A. 

Component dependencies must be non-cyclic. This 
requirement makes it more difficult to split existing 
applications into components (Although they can be 
used as one component!). A cyclic dependency between 
components usually is a sign of bad design and should be 
removed anyway. When a cyclic dependency is present, 
it must be broken by changing the implementation of one 
component to use an interface from an unrelated compo- 
nent while the other class implements this interface. The 
components then both depend on the unrelated compo- 
nent but not on each other. The dependency check is per- 
formed by the verifier and translator. 

We used Sun’s JRE 1.3.1_02 for Linux to obtain the 
transitive closure of the depends-on relation starting with 
java.lang.Object. The implementation dependency con- 
sists of 625 classes; the interface dependency consists of 
25 classes. This means, that each component that uses 
the Object class (1.e., every component) depends on at 
least 25 classes from the JDK. We think, that even 25 
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classes are a too broad foundation for OS components 
and define a compatibility relation that allows to 
exchange the components. 


Compatibility. The whole system is build out of compo- 
nents. It is necessary to be able to improve and extend 
one component without changing all components that 
depend on this component. Only a component B that is 
compatible to component A can be substituted for A. A 
component B is binary compatible to acomponent A, if 
¢ for each class/interface C, of A there is a corre spond- 
ing class/interface Cp in component B 
*class/nterface Cp is binary compatible to class Ca 
according to the definition given in the “Java Language 
Speci fication’ [26] Chapter 13. 

When a binary compatible component is also a 
semantic superset of the original component, it can be 
substituted for the original component without affecting 
the functionality of the system. 


JDK. The JDK is implemented as a normal component. 
Different implementations and versions can be used. 
Some classes of the JOK must access information that is 
onl y available in the runtime system. The class Class is 
an example. This information is obtained by using a por- 
tal to DomainZero. In other words, where a traditional 
JDK implementation would use a native method, JX uses 
a normal method that invokes a service of DomainZero 
via a portal. All of our current components use a JDK 
implementation that is a subset of a full JDK and, there- 
fore, can also be used in a domain that loads a full JDK. 


Interface invocation. Non-cyclic dependencies and the 
compilation of whole components opens up a way to 
compile very efficient interface invocations. Usually, 
interface invocations are a problem because it is not pos- 
sible to use a fixed index into a method table to find the 
interface method. When different classes implement the 
interface, the method can beat different positions in their 
method tables. There exists some work to reduce the 
overhead in a syStem that does not impose our restric- 
tions [1]. In our trans] ator we use an approach that is sim- 
ilar to selector coloring [20]. It makes interface invoca- 
tions as fast as method invocations at the cost of (consid- 
erabl y) larger method tables. 

The size of the x86 machine code in the complete JX 
system is 1,010,752 bytes, which was translated from 
230,421 bytes of bytecode. The method tables consume 
630,388 bytes. These numbers show that it would be 
worthwhile to use a compression technique for the 
method tables or a completely different interface invoca- 
tion mechanism. One should keep in mind, that a tech- 
nique as described in [1] has an average-case perfor- 
mance near to a virtual invocation, but it may be difficult 


to analyze the worst-case behavior of the resulting sys- 
tem, because of the use of a caching data structure. 


2.5 Memory management 


Protection is based on the use of a type-safe lan- 
guage. Thus an MMU is not necessary. The whole sys- 
tem, including all applications, runs in one physical 
address space. This makes the system ideally suited for 
small devices that lack an MMU. But it also leads to sev- 
eral problems. In a traditional system fragmentation is 
not an issue for the user-level memory allocator, because 
allocated, but unused memory, is paged to disk. In JX 
unused memory 1s wasted main memory. So we face a 
similar problem as kernel memory allocators in UNIX, 
where kernel memory usually also is not paged and 
therefore limited. In UNIX a kernel memory allocator is 
used for vnodes, proc structures, and other small objects. 
In contrast to this the JX kernel does not create many 
small objects. It allocates memory for a domain’s heap 
and the small objects live in the heap. The heap is man- 
aged by a garbage collector. In other words, the JX mem- 
ory management has two levels, a global management, 
which must cope with large objects and avoid fragmen- 
tation, and a domain-local garbage-collected memory. 
The global memory is managed using a bitmap allocator 
[46]. This allocator was easy to implement, it automat- 
cally joins free areas, and it has a very low memory foot- 
print: Using 1024-byte blocks and managing about 
128MBytes or 116977 blocks, the overhead is only 
14622 bytes or 15 blocks or 0.01 percent. However, it 
should not be too complicated to use a different all ocator. 

To give up the MMU means that several of their 
responsibilities (besides protection) must be imple- 
mented in software. One example is the stack overflow 
detection, another one the null pointer detection. Stack 
overflow detection is implemented in JX by inserting a 
Stack size check at the beginning of each method. This is 
feasible, because the required size of a stack frame is 
known before the method is executed. The size check has 
a reserve, in case the Java method must trap to a runtime 
function in DomainZero, such as checkcast. The null 
pointer check currently is implemented using the debug 
system of the Pentium processor. It can be programmed 
toraise an exception when data or code at address zero is 
accessed. On architectures that do not provide such a fea- 
ture, the compiler inserts a null-pointer check before a 
reference is used. 

A domain has two memory areas: an area where 
objects may be moved and an area where they are fixed. 
In the future, a single area may suffice, but then all data 
structures that are used by a domain must be movable. 
Currently, the fixed area contains the code and class 
information, the thread control blocks and stacks. Mov- 
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ing these objects requires an extension of the system: all 
pointers to these objects must be known to the GC and 
updated; for example, when moving a stack, the frame 
pointers must be adjusted. 


2.6 Verifier and Translator 


The verifier is an important part of JX. All code is 
verified before it is translated to native code and exe- 
cuted. The verifier first performs a standard bytecode 
verification [48]. It then verifies an upper limit for the 
execution times of the interrupt handlers and the sched- 
uler methods (Sec. 2.8) [2]. 

The translator is responsible for translating bytecode 
to machine code, which in our current system is x86 
code. Machine code can either be allocated in the 
domain’s fixed memory or in DomainZero’s fixed me m- 
ory. Installing it in Domain Zero allows to share the code 
between domains. 


2.7. Device Drivers 


An investigation of the Linux kernel has shown that 
most bugs are found in device drivers [13]. Because 
device drivers will profit most from being written in a 
ty pe-safe language, all JX device drivers are written in 
Java. They use DeviceMemory to access the registers of a 
device and the memory that is available on a device; for 
example, a frame buffer. On some architectures there are 
special instructions to access the I/O bus; for example, 
the in and out processor instructions of the x86. These 
instructions are available via a fast portal of Domain- 
Zero. As other fast portals, these invocations can be 
inlined by the translator. 


DMA. Most drivers for high-throu gh put devices will use 
busmaster DMA to transfer data. These drivers, or at 
least the part that accesses the DMA hardware, must be 
trusted. 


Interrupts. Using a portal of DomainZero, device driv- 
ers can register an object that contains a handlelnterrupt 
method. An interrupt is handled by invoking the han- 
dlelnterrupt meth od of the previously instal led interru pt 
handler object. The method is executed in a dedicated 
thread while interrupts on the interrupted CPU are dis- 
abled. This would be called a first-level interrupt handler 
in aconventional operating syste m. To guarantee that the 
handler can not block the system forever, the verifier 
checks all classes that implement the InterruptHandler 
interface. It guarantees that the handlelnterrupt meth od 
does not exceed a certain time limit. To avoid undecid- 
able problems, only a simple code structure is allowed 
(linear code, loops with constant bound and no write 
access to the loop variable inside the loop). A handletn- 
terrupt method usually acknowled ges the interrupt at the 


General Track: 2002 USENIX Annual Technical Conference 


device and unblocks a thread that handles the interrupt 
asy nchronously. 

We do not allow device drivers to disable interrupts 
outside the interrupt handler. Drivers usually disable 
intermupts as a cheap way to avoid race conditions with 
the interrupt handler. Code that runs with interrupts dis- 
abled in a UNIX kernel is not allowed to block, as this 
would result in a deadlock. Using locks also is not an 
option, because the interrupt handler - running with 
interrupts disabled - should not block. We use the 
abstraction of an AtomicVariable to solve these problems. 
An AtomicVariable contains a value, that can be chan ged 
and accessed using set and get methods. Furthermore, it 
provides a method to atomically compare its value with 
a parameter and block if the values are equal. Another 
method atomically sets the value and unblocks a thread. 
To guarantee atomicity the implementation of Atomic- 
Variable currently disables interrupts on a uniprocessor 
and uses spinlocks on a multiprocessor. Using Atomic- 
Variables we imple mented, for example, a producer/con- 
sumer list for the network protocol stack. 


2.8 Scheduling 


There is a common experience that the scheduler has 

a large impact on the system’s performance. On the other 

hand, no single scheduler is perfect for all applications. 

Instead of providing a configuration interface to the 
scheduler we follow our methodology of allowing a user 
to complete ly replace an imple mentation, in this case the 
scheduler. Each domain may also provide its own sched- 
uler, optimized for its particular requirements. 

The scheduler can be used in several configurations: 

e First, there is a scheduler that is build into the kernel. 
This scheduler is only used for performance analysis, 
be cau se it is written in C and can not be replaced at run 
time. 

¢ The kernel can be compiled without the built-in sched- 
uler. Then all scheduling decisions lead to the invoca- 
tion of a scheduler implementation which is written in 
Java. In this configuration there is one (Java) scheduler 
that schedules all threads of all domains. 

e The most common configuration, however, is a two- 
level scheduling. The global scheduler does not sched- 
ule threads, as in the previous configuration, but 
domains. Instead of activating an application thread, it 
activates the scheduler thread of a domain. This 
domain-local scheduler is responsible for selecting the 
next application thread to run. The global scheduler 
knows all domain-local schedulers and a domain-local 
scheduler has a portal to the global scheduler. On a 
multiprocessor there is one global scheduler per pro- 
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cessor and the domains posses a reference to the global 
schedulers of the processors on which they are allowed 
to run. 

The global scheduler must be trusted by all domains. 
The global scheduler does not need to trust a domain- 
local scheduler. This means, that the global scheduler can 
not assume, that an invocation of the local scheduler 
returns after a certain time. 

To prevent one domain monopolizing the processor, 
the computation can be interrupted by a timer interrupt. 
The timer interrupt leads to the invocation of the global 
scheduler. This scheduler first informs the scheduler of 
the interrupted domain about the pre-emption. It 
switches to the domain scheduler thread and invokes the 
scheduler’s method preempted(). During the execution 
of this method the interrupts are disabled. An upper 
bound f or the execution time of this method has been ver- 
ified during the verification phase. When the method pre- 
empted() returns, the system switches back to the thread 
of the global scheduler. The global scheduler then 
decides, which domain to run next activates the domain- 
local scheduler using the method activated(). For each 
CPU that can be used by a domain the local scheduler of 
the domain has a CPU portal. It activates the next runna- 
ble thread by calling the method switchTo() at the CPU 
portal. The switchTo() method can only be called by a 
thread that runs on the CPU which is represented by the 
CPU portal. The global scheduler does not need to wait 
for the method activated() to finish. Thus, an upper time 
bound for method activated() is not necessary. This 
method makes the scheduling decision and it can be arbi- 
trarily complex. 

If a local scheduler needs smaller time-slices than the 
global scheduler, the local scheduler must be interrupted 
without being pre-empted. For this purpose, the local 
scheduler has a method interrupted() which 1s called 
before the time-slice 1s fully consumed. This method 
operates similar to the method activated(). 

Because our scheduler is implemented outside the 
microkernel and there are operations of the microkernel 
that affect scheduling, for example, thread handoff dur- 
ing a portal invocation, we face a similar situation as a 
user-level thread implementation on a UNIX-like sys- 
tem. A well-known solution are scheduler activations 
[3}, which notify the user-level scheduler about events 
inside the kernel, such as I/O operations. JX uses a simi- 
lar approach, although there are very few scheduling 
related operations inside the kernel. Scheduling is 
affected when a portal method 1s invoked. First, the 
scheduler of the calling domain is informed, that one 
thread performs a porta! call. The scheduler can now 
delay the portal call, if there is any other runnable thread 
in this domain. But it can as well handoff the processor 
to the target domain. The scheduler of the service domain 


is notified of the incoming portal call and can either acti- 
vate the service thread or let another thread of the domain 
run. Not being forced to schedule the service thread 
immediately is essential for the implementation of a non- 
preemptive domain-local scheduler. 

This extra communication is not for free. The time of 
a portal call increases from 650 cycles (see Table |!) to 
920-960 cycles if either the calling domain or the called 
domain 1s inf ormed. If both involved domain schedulers 
are informed about the portal call the required time 
increases to | 180 cycles. 


2.9 Locking and condition variables 


Kernel-level locking. There are very few data structures 
that must be protected by locks inside DomainZero. 
Some of them are accessed by only one domain and can 
be locked by a domain-specific lock. Others, for exam- 
ple, the domain management data structures, need a glo- 
bal lock. Because the access to this data 1s very short, an 
implementation that disables interrupts on a uniproces- 
sor and uses spinlocks on a multiprocessor 1s sufficient. 


Domain-level locking. Domains are responsible for 
synchronizing access to objects by their own threads. 
Because there are no objects shared between domains 
there is no need for inter-domain locking of objects. Java 
provides two facilities for thread synchronization: mutex 
locks and condition variables. When translating a com- 
ponent to native code, an access to such a construct 1s 
redirected to a user-supplied synchronization class. How 
this class is implemented can be decided by the user. It 
can provide no locking at all or it can implement mutexes 
and condition variables by communicating with the 
(domain-local) scheduler. Every object can be used as a 
monitor (mutex lock), but very few actually are. To avoid 
allocating a monitor data structure for every object, tra- 
ditional JVMs either use a hashtable to go from the 
object reference to the monitor or use an additional 
pointer in the object header. The hashtable variant is slow 
and is rarely used in today’s JVMs. The additional 
pointer requires that the object layout must be changed 
and the object header be accessible to the locking system. 
Because the user can provide an own implementation, 
these two implementations, or a completely application- 
specific one, can be used. 


Inter-domain locking. Memory objects allow sharing 
of data between domains. JX provides no special inter- 
domain locking mechanisms. When two domains want to 
synchronize, they can use a portal call. We did not need 
such a feature yet, because the code that passes memory 
between domains does it by explicitly revoking access to 
the memory. 
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3 Application Scenarios: Comparing JX to 
a Traditional Operating System 


JX contains a file system com- 
ponent that is a port of the Linux 
ext2 file system to Java [45]. Fig- 
ure 2 shows the configuration, 
where file system and _ buffer 
cache are cleanly separated into 
different components. The gray 
areas denote protection domains 
and the white boxes components. 
The file system uses the Buffer- 
Cache interface to access disk 
blocks. To read and write blocks 
to a disk the buffer cache imple- 
mentation uses a reference to a 
device that implements the 
BlocklO interface. The file system 
and buffer cache components do 
not use locking. They require a 
non-preemptive scheduler to be installed in the domain. 

To evaluate the performance of JX we used two 
benchmarks: the IOZone benchmark [44] to assess file 
system performance and a home brewed rate benchmark 
to assess the performance of the network stack and NFS 





Figure 2: IOZone 
configuration 


- server. The rate benchmark sends getattr requests to the 
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NFS server as fast as possible and measures the achiev- 
able request rate. As JX 1s a pure Java system, we can not 
use the original IOZone program, which Is written in C. 
Thus we ported [OZone to Java. The JX results were 
obtained using our Java version and the Linux results 
were obtained using the original IOZone. 
The hardware consists of the following components: 
¢ The system-under-test: PIII SOOMHz with 256 MBytes 
RAM anda 100 MBit/s 3C905B Ethernet card running 
Suse Linux 7.3 with kernel 2.4.0 or JX. 
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Figure 3: Linux IOZone performance 
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* The client for the NFS benchmark: a PIII I1GHz with a 
100 MBit/s 3C905B Ethernet card running Suse Linux 
43: 

¢A IOOMBit/s hub that connects the two systems. 

Figure 3 shows the results of running the IOZone 
reread benchmark on Linux. 

Our Java port of the IOZone contains the write, 
rewrite, read, and reread parts of the original benchmark. 
In the following discussion we only use the reread part of 
the benchmark. The read benchmark measures the time 
to read a file by reading fixed-length records. The reread 
benchmark measures the time for a second read pass. 
When the file is smaller than the buffer cache all data 
comes from the cache. Once a disk access 1s involved, 
disk and PCI bus data transfer times dominate the result 
and no conclusions about the performance of JX can be 
drawn. To avoid these effects we only use the reread 
benchmark with a maximum file size of 512 KBytes, 
which means that the file completely fits into the buffer 
cache. The JX numbers are the mean of SO runs of 
IOZone. The standard deviation was less than 3%. For 
time measurements on JX we used the Pentium times- 
tamp counter which has a resolution of 2 ns on our 
system. 

Figure 2 shows the configuration of the JX system 
when the IOZone benchmark is executed. Figure 4 shows 
the results of the benchmark. Figure 5 compares JX per- 
formance to the Linux performance. Most combinations 
of file size and record size give a performance between 
20% and 50% of the Linux performance. Linux is espe- 
cially good at reading a file using a small record size. The 
performance of this JX configuration 1s rather insensitive 
to the record size. We will explain how we improved the 
performance of JX in the next section. 

Another benchmark is the rate benchmark, which 
measures the achievable NFS request rate by sending 
getattr requests to the NFS server. Figure 6 shows the 
domain structure of the NFS server: all components are 
placed in one domain, which Is a typical configuration 
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for a dedicated NFS server. Figure 8 shows the results of 
running the rate benchmark with a Linux NFS server 
(both kernel and user-level NFS) and with a JX NFS 
server. There are drops in the JX request rate that occur 
very periodically. To see what is going on in the JX NFS 
server, we collected thread switch information and cre- 
ated a thread activity diagram. Figure 7 shows this dia- 
gram. We see an initialization phase which is completed 
six seconds after startup. Shortly after startup a periodic 
thread (ID 2.12) starts, which is the interrupt handler of 
the real-time clock. But the important activity starts at 
about 17 seconds. The CPU is switched between 
“TRQThread! 1’’, ““Etherpacket-Queue’’, “NFSProc’’, and 
“Idle” thread. This is the activity during the rate bench- 
mark. Packets are received and put into a queue by the 
first-level interrupt handler of the network interface 
“TRQThread1]1” (ID 2.14). This unblocks the “Ether- 
packet-Queue” (ID 2.19), which processes the packet 
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and finally puts it into a UDP packet queue. This 
unblocks the “NFSProc” (ID 2.27) thread, which pro- 
cesses the NFS packet and accesses the file system. This 
is done in the same thread, because the NFS component 
and the file system are collocated. Then a reply is sent 
and all threads block, which wakes up the “Idle” thread 
(ID 0.1). The sharp drops in the request rate of the JX 
NFS server in Figure 8 correspond to the GC thread (ID 
2.1) that runs for about 100 milliseconds without being 
interrupted . It runs that long because neither the garbage 
collector nor the NFS server are optimized. Especially 
the RPC layer creates many objects during RPC packet 
processing. The GC is not interrupted, because it disables 
interrupts as a Safety precaution in the current implemen- 
tation. The pauses could be avoided by using an incre- 
mental GC [6], which allows the GC thread to run con- 
currently with threads that modify the heap. 
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Figure 7: Thread activity during the rate benchmark 
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4 Optimizations 


JX provides a wie range of flexible configuration 
options. Depending on the intended use of the system 
several features can be disabled to enhance performance. 

Figures 9 through 14 show the results of running the 
Java [OZone benchmark on JX with various configura- 
tion options. These results are discussed in further detail 
below. The legend for the figures indicates the specific 
configuration options used in each case. The default con- 
figuration used in Figure 3 was MNNSCR, which means 
that the configuration options used were multi-domain, 
no inlining, no inlined memory access, safety checks 
enabled, memory revocation check by disabling inter- 
rupts, and a Java round-robin scheduler. At the end of this 
section we will select the fastest configuration and repeat 
the comparson to Linux. 

The modifications described in this sections are pure 
configurations. Not a single line of code is modified. 


4.1 Domain structure 


How the system 1s structured into domains deter- 
mines communication overheads and thus affects perfor- 
mance. For maximal performance, components should 
be placed in the same domain. This removes portal com- 
munication overhead. Figure 9 shows the improvement 
of placing all components into a single domain. The per- 
formance improvement is especially visible when using 
small record sizes, because then many invocations 
between the IOZone component and the filesystem com- 
porent take place. The larger improvement in the 4KB 
file size / 4KB record size can be explained by the fact 
that the overhead of a portal call is relatively cons tant and 
the 4KB test is very fast, because it completely operates 
in the L1 cache. So the portal call time makes up a conr- 
siderable part of the complete time. The contrary 1s true 
for large file sizes: the absolute throughput is lower due 
to processor cache misses and the saved time of the por- 
tal call is only a small fraction of the complete time. 
Within one file size the effect also becomes smaller with 
Increasing record sizes. This can be explained by the 
decreasing number of performed portal calls. 


4.2 Translator configuration 


The translator performs several optimizations. This 
section investigates the performance impact of each of 
these optimizations. The optimizations are inlining, 
inlining of fast portals, and elimination of safety checks. 


4.2.1 


One of the most important optimizations in an object- 
oriented system 1s inlining. We currently inline only non- 
virtual methods (final, static, or private). We plan to 


Inlining 
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inline also virtual methods that are not overridden, but 
this would re quire a recompilation when, at a later time, 
a Class that overrides the method is loaded into the 
domain. Figure 10 shows the effect of inlining. 


4.2.2 


A fast portal interface (see Sec. 2.2) that is known to 
the translator can also be inlined. To be able to inline 
these methods that are written in C or assembler the 
translator must know their semantics. Since we did not 
want to wire these semantics too deep into the translator, 
we developed a plugin architecture. A translator plugin is 
responsible for translating the invocations of the methods 
of a specific fast portal interface. It can either generate 
special code or fall back to the invocation of the Domain- 
Zero method. 

We did expect a considerable performance improve- 
ment but as can be seen in Figure 11 the difference is 
very small. We assume, that these are instruction cache 
effects: when a memory access is inlined the code is 
larger than the code that is generated for a function call. 
This is due to range checks and revocation checks that 
must be emitted in front of each memory access. 


Inlining of fast portals 


4.2.3 Safety checks 


Safety checks, such as stack size check and bounds 
checks for arrays and memory objects can be omitted on 
a per-domain basis. Trans lating a domain without checks 
is equivalent to the traditional OS approach of hoping 
that the kernel contains no bugs. The system is now as 
unsafe as a kernel that is written in C. Figure 12 shows 
that switching off safety checks can give a performance 
improvement of about | 0 percent. 


4.3 Memory revocation 


Portals and memory objects are the only objects that 
can be shared between domains. They are capabilities 
and an important functionality of capabilities is re voca- 
tion. Portal revocation is implemented by checking a flag 
before the portal method is invoked. This is an inexpen- 
sive operation compared to the whole portal invocation. 
Revocation of memory objects is more critical because 
the operations of memory objects - reading and writing 
the memory - are very fast and frequently used opera- 
tions. The situation is even more involved, because the 
check of the revocation flag and the memory access have 
to be performed as an atomic operation. JX can be con- 
figured to use different implementations of this revoca- 
tion check: 
¢ NoCheck: No check at all, which means revocation is 

not supported. 
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Legend for all figures on this page: 


Encoding of the measured configuration: 
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memory access: 


S (single domain), M (multi domain) 
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Figure 14a: Simple Java Scheduler: MIFSN/ vs. MIFSNR 
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¢ CLI: Saves the interrupt-enable flag and disables inter- 
rupts before the memory access and restores the inter- 
rupt-enable flag afterwards. 

¢ SPIN: In addition to disabling interrupts a spinlock is 
used to make the operation atomic on a multiprocessor. 

¢ ATOMIC: The JX kernel contains a mechanism to 
avoid locking at all on a uniprocessor. The atomic code 
is placed in a dedicated memory area. When the low- 
level part of the interrupt system detects that an inter- 
rupt occurred inside this range the interrupted thread 1s 
advanced to the end of the atomic procedure. This tech- 
nique 1s fast in the common case but incurs the over- 
head of an additional range check of the instruction 
pointer in the interrupt handler. It increases interrupt 
latency when the interrupt occurred inside the atomic 
procedure, because the procedure must first be fin- 
ished. But the most severe downside of this technique 
is, that it inhibits inlining of memory accesses. Similar 
techniques are described in [9], [36], [35], [41]. 

Figure 13a shows the change in performance when no 
revocation checks are performed. This configuration is 
slightly slower than a configuration that used the CLI 
method for revocation check. We can only explain this by 
code cache effects. 

Using spinlocks adds an additional overhead (Figure 
13b). Despite some improvements in a former version of 
JX using atomic code could not improve the IOZone per- 
formance of the measured system (Figure 13c). 


4.4 Cost of the open scheduling framework 


Scheduling in JX can be accomplished with user- 
defined schedulers (see Sec. 2.8). The communication 
between the global scheduler and the domain schedulers 
is based on interfaces. Each domain scheduler must 
implement a certain interface if it wants to be informed 
about special events. If a scheduler does not need all the 
provided information, it does not implement the corre- 
sponding interface. This reduces the number of events 
that must be delivered during a portal call from the 
microkernel to the Java scheduler. 

In the configurations presented up to now we used a 
simple round-robin scheduler (RR) in each domain. The 
domain scheduler is infonned about every event, regard- 
less whether being interested in it or not. Figure 14a 
shows the benefit of using a scheduler which implements 
only the interfaces needed for the round-robin strategy 
(RR invisible portals) and is not informed when a thread 
switch occurred due to a portal call. 

As already mentioned, there is a scheduler built into 
the microkernel. This scheduler 1s implemented in C and 
can not be exchanged at run time. Therefore this type of 
scheduling is mainly used during development or perfor- 
mance analysis. The advantage of this scheduler ts that 
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achieved throughput in percent 


there are no calls to the Java level necessary. Figure 14b 
shows that there 1s no relevant performance difference in 
IOZone performance between the core scheduler and the 
Java scheduler with invisible portals. 


4.5 Summary: Fastest safe configuration 


After we explained all the optimizations we can now 
again compare the performance of JX with the Linux 
performance. The most important optimizations are the 
use of a single domain, inlining, and the use of the core 
scheduler or the Java scheduler with invisible portals. We 
configured the JX system to make revocation checks 
using CLI, use a single domain, use the kernel scheduler, 
enabled inlining, and disabled inlining of memory meth- 
ods. With this configuration we achieved a performance 
between about 40% and 100% of Linux performance 
(Figure 15). By disabling safety checks we were even 
able to achieve between 50% and 120% of Linux 
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Figure 15: JX vs. Linux: Fastest configuration (SINSCC) 
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5 Related work 


There are several areas of related work. The first two 
areas are concerned with general principals of structur- 
ing an operating system: extensibility and reusability 
across system configurations. The other areas are lan- 
guage-based operating systems and especially Java oper 
ating systems. 


Extensibility. With respect to extensibility JX is similar 
to L4 [33], Pebble [25], and the Exokernel [24] in that it 
tries to reduce the fixed, static part of the kernel. It is dif- 
ferent from systems like SPIN [8] and VINO [40}, 
because these systems only allow a gradual modification 
of the system service, using spindles (SPIN) or grafts 
(VINO). JX allows its complete replacement. This is 
necessary in some cases and in most cases will give a bet- 
ter performance, because more suitable algorithms can 
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be used inside the service. A system service with an 
extension interface will only work as long as the exten- 
sions fit into a certain pattern that was envisioned by the 
designer of the interface. A more radical change of the 
service is not possible. 

An important difference between JX and previous 
extensible systems is, that in JX the translator is part of 
the operating system. This allows several optimizations 
as described in the paper. 


Modularity and protection. Orthogonality between 
modularity and protection was brought forward by Lipto 
[22]. The OSF [15] attacked the specific problem of col- 
locating the OSF/1 UNIX server, which was run on top 
of the Mach microkernel, with the microkernel. They 
were able to achieve a performance only 8% slower than 
a monolithic UNIX. The special case of code reuse 
between the kernel and user environment was investi- 
gated in the Rialto system [21]. Rialto uses two inter- 
faces, a very efficient one for collocated components (for 
example the mbuf [34] interface) and another one when 
a protection boundary must be crossed (the normal read/ 
write interface), We think that this hinders reusability 
and complicates the implementation of components, 
especially as there exist techniques to build “unified” 
interfaces in MMU-based systems [23], and, using our 
memory objects, also in language -based systems. 

There is a considerable amount of work in single 
address space operating systems, such as Opal [11] and 
Mungi [29]. Most of these systems use hardware protec- 
tion, depend on the mechanisms that are provided by the 
hardware, and must structure the system accordingly, 
which makes their problems much different from ours. 


Language-based OS. Using a safe language as a protec- 
tion mechanism is an old idea. A famous early system 
was the Pilot [38], which used a language and bytecode 
instruction set called Mesa [30], an instruction set for a 
stack machine. Pilot was not designed as a multtuser 
Operating system. More recent operating systems that 
use safe languages are SPIN [8], which uses Modula3, 
and Oberon [47], which uses the Oberon language, a 
descendant of Modulaz2. 


Java OS. The first Java operating system was JavaOS 
from Sun [39]. We do not know any published perfor- 
mance data for JavaOS, but because it used an inter- 
preter, we assume that it was rather slow. Furthermore, it 
did only provide a single protection domain. This makes 
sense, because JavaOS was planned to be a thin-client 
OS. However, besides JX, JavaOS is the only system that 
tried to implement the complete OS functionality in Java. 
JKernel [28], the MVM [16], and KaffeOS [4] are sys- 
tems that allow isolated applications to run in a single 
JVM. These systems are no operating systems, but con- 


tain several interesting ideas. JKernel 1s a pure Java pro- 
gram and uses the name spaces that are created by using 
different class loaders, as a means of isolation. JKernel 
concentrates on the several aspects how to implement a 
capability mechanism in pure Java. It relies on the JVM 
and OS for resource management. The MVM is an exten- 
sion of Sun’s HotSpot JVM that allows running many 
Java applications in one JVM and give the applications 
the illusion of having a JVM of their own. It allows to 
share bytecode and JIT-compiled code between applica- 
tions, thus reducing startup time. There are no means for 
resource control and no fast communication mechanisms 
for applications inside one MVM. KaffeOS is an exten- 
sion of the Kaffe JVM. KaffeOS uses a process abstrac- 
tion that 1s similar to UNIX, with kernel-mode code and 
user-mode code, whereas JX is more structured like a 
multi-server microkernel system. Communication 
between processes in KaffeOS is done using a shared 
heap. Our goal was to avoid sharing between domains as 
much as possible and we, therefore, use RPC for inter- 
domain communication. Furthermore, KaffeOS is based 
on the Kaffe JVM, which limits the overall performance 
and the amount of performance optimizations that are 
possible in a custom-build translator like ours. 

These three systems do not have the robustness 
advantages of a 100% Java OS, because they rely ona 
traditional OS which is written in a low-level language, 
usually C. 


6 Conclusion and future work 


We described the J X operating system and its perfor- 
mance. While being able to reach a performance of about 
50% to 100% of Linux in a file system benchmark in a 
monolithic configuration, the system can be used in a 
more flexible configuration with a slight performance 
de gradation. 

To deliver our promise of outperforming traditional, 
UNIX-based operating systems, we have to further 
improve the translator. The register al location 1s still very 
simple, which 1s especially unsatisfactory on a processor 
with few registers, like the x86. 

We plan to refine the memory objects. Several addi- 
tional memory semantics are possible. Examples are 
copy-onwrite memory, a memory ob¢ct that represents 
non-continuous chunks of memory as one memory 
object, or a memory object that does not allow revoca- 
tion. All these semantics can be implemented very effi- 
ciently using compiler plugins. The current implementa 
tion does not use an MMU because it does not need one. 
MMU support can be added to the system to expand the 
address space or implement a copy-on-wnte memory. 
How this complicates the architecture and its implemen- 
tation remains to bee seen. 
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Abstract 


File systems have (at least) two undesirable characteristics: both the addressing model and the consistency semantics 
differ from those of memory, leading to a change in programming model at the storage boundary. Main memory Is a 
single flat space of pages witha simple durability (persistence) model: all or nothing. File content durability is a complex 
function of implementation, caching, and timing. Memory Is globally consistent. File systems offer no global consistency 
model. Following a crash recovery, individual files may be lost or damaged, or may be collectively inconsistent even 
though they are individually sound. 

Single level stores offer an alternative approach in which the memory system is extended all the way down to the 
disk level. This extension is accompanied by a transacted update mechanism that ensures globally consistent durability. 
While single level stores are both simpler and potentially more efficient than a file system based design, relatively little 
has appeared about them in the public literature. This paper describes the evolution of the EROS single level store design 
across three generations. Two of these have been uscd; the third is partially implemented. We identify the critical design 
requirements for a successful single level store, the complications that have arisen in each design, and the resolution of 
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these complications. 


As the performance of the EROS system has been discussed elsewhere, this paper focuses exclusively on design. Our 
objective is to both clearly express how a single level store works and to expose some non-obvious details in these designs 


that have proven to be important in practice. 


1 Introduction 


Single level stores simplify operating system design by re- 
moving an unnecessary layer of abstraction from the sys- 
tem. Instead of implementing a new and diffierent seman- 
tics at the file system layer, a single level store extends the 
memory mapping model downwards to include the disk. 
Where conventional operating systems use the memory 
mapping hardware to translate virtual page addresses to 
physical pages, single level stores map virtual page ad- 
dresses to /ogical page addresses, using physical memory 
as a software-managed cache to hold these pages. 


The most widely-used single level store design is prob- 
ably the IBM System/38, more commonly known as the 
AS/400 [IBM98]. At the hardware level, the AS/400 is a 
capability-based object system. The AS/400 design treats 
the entire store as a unified, 64-bit address space. Ev- 
ery object is assigned a 16 megabyte segment within this 
space. Persistence is managed explicitly — changes to ob- 
Jects are rewritten to disk only when directed by the appli- 
cation. While the protection architecture and object struc- 
ture of the AS/400 is described by Soltis [Sol96], key de- 
tails of its single-level store implementation are unpub- 
lished. 


Like the AS/400, EROS is a capability-based single level 


* This research was supported by DARPA undercontract #4N66001-96- 
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store design. Unlike AS/400, EROS manages persistence 
transparently using an efficient, transacted checkpointsys- 
tem that runs periodically in the background. Applica- 
tions rely on the kernel to transparently handle persis- 
tence, leaving the applications free to build data struc- 
tures and algorithms without regard to disk-level place- 
ment or the need to protect recoverability through care- 
ful disk write ordering. The EROS system and its per- 
formance have been described elsewhere [SSF99]. This 
paper describes how the EROS single level store design 
is integrated into the system and its key components, and 
the evolution of this design over the last decade. 


As an initial intuition for single level stores, imagine a 
system design that begins by assuming that the machine 
never crashes. In such a system, there would be no need 
for a file system at all; the entirety of the disk is used as a 
large paging area. Such a design would clearly eliminate a 
large body of code from a conventional operating system. 
The EROS system, including user-mode applications that 
implement essential ftnctions, is currently 103,712 lines 
of code.'! Excluding drivers, networking protocols, and 
include files, the Linux 2.4 kernel contains 383,698 lines 
of code, of which 283,956 implement support for vart- 
ous file systems and file mapping. While the two sys- 
tems implement very diffierent semantics, their functton- 
ality is comparable, and the EROS code provides features 


' EROS drivers and network stack are implemented outside the kemel. 
Driver and network code size is not included in either estimation of 
code size. 
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that Linux lacks: on restart, EROS recovers processes and 
interprocess communication channels in addition to data 
objects. 


The design challenges ofa single-level store are (1) to ad- 
dress the problem that systems actually do crash, and en- 
Sure that consistency is preserved when this occurs, (2) to 
devise some efficient means of addressing this very large 
paging area, and (3) to provide some means for spect 
fying desired locality, preferably in a fashion infonned 
by application-Ievel semantic knowledge. This paper de- 
scribes three designs that meet these challenges in two dif- 
ferent EROS kernel designs. Other potential applications 
of these design ideas include database storage managers, 
Stora ge-attached networks, and logical volume systems. 


The first design presented is the one used by KeyKOS, 
which was inherited by the original EROS system in 1991. 
This design suffered from minor irritations that caused us 
to revise the design in 1997. In 2001, 1t was decid ed to re- 
move drivers from the EROS kernelentirely, which forced 
us to rethink and partially rebuild the single level store 
yet again. Aside from the storage allocator itself, all of 
these designs present identical external semantics to ap- 
plications. 


The balance of this paper procceds as follows. We first 
provide a brief overview of the EROS object system, its 
storage model and the mechanism used to ensure global 
consistency. This discussion introduces the critical re- 
quirements that must be satisfied by a single-level storc 
design. We then describe the user-level storage allocator, 
which bears responsibility for locality management and 
storage reclamation. We then describe each of the exist- 
ing design generations in turn, and the motivation behind 
each revision. The paper concludes with related work, 
lessons learned, and some hints as to our future plans. 


2 Object System Overview 


EROS is a microkernel design. The kerncl implements 
a small number of object types, but leaves storage allo- 
cation, fault handling, address space management, and 
many other traditional kernel functions to user-level code. 
For this paper, the most important function implemented 
by user-level code is the space bank (Section 5). The 
space bank is responsible for all storage allocation, for 
storage quota enforcement, for bulk storage reclamation, 
and for disk-level object placement At the kernel inter- 
face, all of this is accomplished by allocating and deallo- 
cating objects with appropriately selected unique object 
identifiers. Every object has a unique object identifier 
(OID) OIDs directly correlate to disk locations, which 
enables the space bank to perform object placement. 


The EROS kernel design consists of three layers (Fig- 
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ure 1). The machine layer stores process and memory 
mapping information using a representation that 1s con- 
venient to the hardware. For process state, this represen- 
tation is determined by the design of the hardware con- 
text switch mechanism. For memory mapping state, it is 
determined by the design of the hardware memory man- 
agement unit. These layers are managed as a cache of 
selected objects that logically reside in the object cache. 
When necessary, entries in the process cache or memory 
mapping tables are either written back or invalidated. En- 
tries in the process cache correspond to entries in the pro- 
cess table of a more conventional design, but processes 
may be moved in and out of the process cache several 
times during their lifespan. 















Process Mapping Machine 
Cache Tables Layer 
Node Cache Page ; 

Cache Object 

Object Cache (Main Memory) Cache 

Checkpoint Home Object 
Area Locations Store 





Figure 1: EROS design layers. 


The object cache occupies the bulk of main memory. At 
this layer there are only two types of objects: pages and 
nodes. Pages hold user data. Nodes hold capabilities. 
Every node and page has a corresponding location in the 
home location portion of the object store. As with the ma- 
chine lay er, the object cache is a software-managed cache 
of the state on the disk. 


As is implied by Figure 1, all higher-level op erating sys- 
tem abstractions are composed from these two fundamen- 
tal units of storage. Process state is stored in nodes, and 
is loaded into the process cache at need. Address space 
mappings are likewise represented using trees of nodes. 
These are traversed to construct hardware mapping data 
structures as memory faults occur. Details of these trans- 
formations can be found in [SSF99, SFS96]. 


The object store layer is the object system as it exists on 
the disk. At this layer the system ts divided into two parts: 
the “home locations,” which provide space for every ob- 
ject in the system, and the “checkpoint area,’ which pro- 
vides the means for building consistent snapshots of the 
system. All object writes are perfor med to the checkpoint 
area. Revised objects are migrated to their home loca- 
tions only after a complete, system-wide transaction has 
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been successfully committed. The check point mechanism 
1s described in Section 3.2. 


Collectively, these layers implement a two-level caching 
design. At need, the entire user-visible state of the system 
can be reduced to pages and nodes, which can then be 
written to disk. 


3 Storage Model 


The mam reason for having two types and sizes of objects 
is to preserve a partition between data and capabilities. 
Data resides in pages and capabilities reside in nodes. It is 
certainly possible to design a single level store in which all 
objects are the size of a page, but it proved inconvenient 
to do so in EROS for reasons of storage efficiency. In 
the current implementation, capabilities occupy 16 bytes 
and nodes 544 bytes, but these sizes are not exposed by 
any external system interface. This leaves us free in the 
future to change the size of capabilities compatibly, much 
as was done in the AS/400 transition from 48-bit t 0 64-bit 
addresses. 


Every object in the store has a unique object identifier 
(OID). Objects on the disk are named by operat ing-system 
protected capabilities [Dv66], each of which contains an 
object type (node, page, or process), the OID of the ob- 
ject it names, and a set of permissions. An EROS object 
capability is similar to a page table entry (PTE) that con- 
tains the swap location of an out-of-memory page.” When 
an object capability is actively in use, the EROS kernel 
rewrites the capability internally to point directly at the 
in-memory copy of the object. 


Pages contain user data. Nodes contain a fixed-size array 
of capabilities, and are used as indirect blocks, memory 
mapping tables, and as the underlying representation for 
process state. Within the store, these nades are pack ed 
into page-sized containers called “node pots.” All I/O to 
or from the store is performed in pag e-sized units. 


3.1 Home Locations 


Every object in the EROS store has a uniquely assigned 
“home location” on some disk. Some versions of EROS 
implement optional object mirroring, in which case the 
same object may appear on the disk at multiple locations 
and is updated (when needed) at all locations. Additional 
mirroring or RAID storage may be performed by the stor- 
age controller. This is invisible to the EROS kernel. 


The bulk of the disk space in an EROS system is used to 
contain the “home locations” of the objects. The basic 
design requirements for this part of the store are: 


2 Inthe PTE case, no object type is needed because PTEs can only name 
pages. 


e Object fetch and store should be efficient. As a 
result, there should be a simple, m-memory strat- 
egy for directly translating an OID value to the disk 
page frame (the home location) that contains the ob- 
ject; 


e EROS does not expose the physical locations of ob- 
jects outside the kernel; only OIDs are visible, and 
these only to selected applications. For purposes 
of locality management, there must be some well- 
defined relationship between OID values and disk 
locations. 


Without an in-memory algorithm to translate an OID into 
a disk object address, the store would require disk-level 
directory data structures that would in turn require addi 
tional, sequentially dependent 1/O accesses to locate and 
fetch an object. The elimination of such additional ac- 
cesses is a performance-critical imperative. The encod ing 
of object locations, the organization of the disk, and the 
locality management of home locations has been the main 
focus of evolution in the design of the storage manager. 


3.2 Checkpointing 


To ensure that global consistency for all processes and ob- 
jects is mamtained across restarts, it is sufficient for the 
kemel to periodically write down an instantaneous snap- 
shot of the state of the corresponding pages and nodes. 
To accomplish this, EROS implements an efficient, asyn- 


chron ous, system-wide check point mechanism derived from 


the check point design of KeyKOS [Lan92 ]. 


The check pomt system makes use of a dedicated area on 
the disk. This area is also used for normal paging, and is 
conceptually equivalent to the “swap area” of a conven- 
tional paging system. Before any node or page is made 
dirty, space is reserved for it m the checkpomnt area. As 
memory pressure induces paging, dirty objects are paged 
out to (and if necessary, reread from) the check pomt area. 
Object locations in the checkpoint area are recorded in an 
in-memory directory. 


Periodically, or when the checkpomnt area has reached a 


predefined occupancy threshold, the kemel declares a “‘snap- 


shot,” in which every dirty object in memory 1s marked 
‘copy on write.” Simultaneously, a watermark is made 
in the check point area. Everything modified prior to the 
snapshot will be written beneath this watermark; every- 
thing modified after the snapshot is written above it. The 
kernel now all ows executionto proceed, and initiates back - 
ground processing to flush all of the pre-snapshot dirt y ob- 
jects into the previously reserved space in the check point 
area. Check point area 1/O is append-only. If an object is 
dirtied multiple times, no attempt is made to reclaim its 
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previously occupied space in the checkpoint area. This 
ensures that checkpoint I/O is mostly sequential. 


Once all objects have been written to the checkpoint area, 
an area directory is written and a log header is rewritten 
to capture the fact that a consistent system-wide transac- 
tion has been completed. A background “migrator” now 
copies these objects back to their home locations in the 
store. Whenever a checkpoint transaction completes, the 
checkpoint area space occupied by the previous transac- 
tion is released. To ensure that there is always space in 
the checkpoint area for the next checkpoint, migration is 
required to complete before a new checkp oi nt transaction 
can be completed. 


The net effect of the checkpoint system is to capture a 
consistent system-wide image of the system state. If de- 
sired, checkpoints can be run at frequencies comparable 
to those of conventional buffer cache flushes, making the 
potential loss of data identical to that of conventional sys- 
tems. To supportthe requirements of database logs, there 
is a special “escape hatch” mechanism permitting imme- 
diate transaction of individual pages. 


3.3 Intuitions for Latency 


While EROS is not yet running application code, KeyKOS 
has been doing so since 1980, supporting both transaction 
processing and (briefly) general purpose workloads. The 
performance of the checkpoint design rests on two empir- 
ical observations from Key KOS: 


@ Over 85% of disk reads are satisfied from the check- 
point area. 


e Over 50% of dirty objects die or are redirtied be- 
fore they are migrated. Such objects do not require 
migration. 


These two facts alter the seek profile of the system, re- 
ducing effective seek latencies for reads. They also alter 
the rotational delay profile of the system, reducing effec- 
tive rotational latencies for writes. Our goal in this section 
is to provide an intuition for why this is true. While the 
specific measurements obtained from KeyKOS probably 
will not hold for EROS twenty years later, we expect that 
the performance of checkpointing will remain robust. We 
will discuss the reasons for this expectation below. 


It is typical for the amount of data included in a given 
checkpoint to be comparable to the size of main mem- 
ory. The checkpoint area as a whole must beableto hold 
two checkpoints. Given a machine with 256 megabytes 
of memory, the expected checkpoint area would be 512 
megabytes. On a Seagate Cheetah (ST373405LC, 29550 
cylinders, 68 Gbytes), this region would occupy 0.7% of 
the disk, or 216 cylinders. As 100% of normal writes and 
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85% of all reads occur within this region, the arm position 
remains within a very narrow range of the disk with high 
probability. 


Estimating disk latencies is deceptive, because computa- 
tions based on the published minimum, average, maxi- 
mum seek times have nothing to do with actual behavior. 
Seek time profiling is required for effective estimation. 
The seek time calculations presented here are based on 
profiling data collected by Jiri Schindler using DI Xtrac 
[SG99]. We emphasize that these are computed, rather 
than measured results. We will assume a disk layout in 
which the checkpoint area is placed on the middle cylin- 
ders of the drive. 


3.3.1 Expected Read Behavior 

The disk head positionin KeyKOS has a non-uniform dis- 
tribution. To compute the expected seek time, we must 
consider the likely location of the preceding read as well 
as the current one (Table 1), giving an expected seek time 
for reads of 2.92ms on a drive whose average seek time 
is 6.94 ms. This expectation is robust in the face of both 
changes to the checkpoint region size and reasonable re- 
ductions in checkp oint locality. Increasing the checkpoint 
area size to 200 cylinders raises the expected seek time to 
3.05 ms. Reducing the checkpoint “hit rate” to 70% yields 
an expected seek time of 3.80 ms. 


Conventional file system read performance is largely de- 
termined by the average seek delay (in this case, 6.94 ms). 
For comparing read delays, rotational latencies can be ig- 
nored: the expected rotation delay on both systems Is one 
half of a rotation per read. 


The difference inexpected performanceis largely immune 
to changes in extent size or prefetching, as both tech- 
niques can be used equally well on both systems. Both 
techniques reduce the total number of seeks perfonned; 
neither alters the underlying seek latency or distribution. 
Similarly, low utilization yields similar benefits in both 
systems by reducing the effect of long seeks. The read 
performance of both designs converges onthe performance 
of the checkpointing design as utilization falls — on suf- 
ficiently small, packed data sets there is no meaningful 
difference in seek behavior. 


3.3.2 Expected Write Behavior 


As in log-based file systems, a checkp ointing design po- 
tent ially performis two writes for every dirty block: one to 
the chec kp oint area and the second to the home locations. 
Migration is skipped for data that is remodified or deleted 
bet ween the time of checkpoint and the time of migration. 
Given this, there are two thresholds of interest: 
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_Current I/O Preceding I/O 
Checkpoint (85%) Checkpoint (85%) 
Checkpoint (85%) Other (15%) 
Other (15%) Checkpoint (85%) 

_ Other (15%) Other (15%) 


_ Distance Time Weighted 
108 cyl 1.97ms_ 1.42 ms 
7387 cyl 5.27ms 0.67 ms 
7387 cyl 5.27ms 0.67 ms 
14775 cyl 6.94ms 0.16ms 

Weighted Seek Time 2.92 ms 


Table 1: Expected read latency. Based on seek profile of the Seagate Cheetah 
ST373405Lc. The reported average seek time for this drive is 6.94 ms. 


I. How many objects live long enough to get check- 
pointed. 

2. Of those, how many live long enough to be mi- 
grated. 


The best available data on file longevity is probably the 
data collected by Baker et a/. [BHK* 91]. Figure 4 of their 
paper indicates that 65% to 80% of all files live less than 
30 seconds, that 80% of all files live less than 300 seconds, 
and that 90% of all files live less than 600 seconds (one 
checkpoint interval). We can estimate from this that less 
than half of all files that are checkpointed survive to be 
migrated, This is consistent with the measured behavior 
of KeyKOS: only 50% of the checkpoint data survives to 
be migrated. 


The impact of this is surprising. Imagine that there are 400 
kilobytes of file data to be written to a conventional file 
system. The key question proves to be: what are the run 
lengths? Figure | of the Baker measurements [BHKT 91] 
shows that most file run lengths are small. While the fig- 
ure does not differentiate read and write run lengths, Table 
3 suggests that write run lengths are primarily driven by 
file size: 70% of all bytes written are “whole file” writes. 
Figure 2 shows that 80% of files are 10 kilobytes or less. 
Taken together, these numbers mean that the cost of bulk 
flushes of the data cache are dominated by rotational! de- 
lay. The 400 kilobytes in question will be written at nearly 


40 distinct locations, each of which will require 1/2 a ro-- 


tation to bring the head to the correct position within the 
track. On the Cheetah, the rotational delay alone comes to 
119 ms. Seek delays depend heavily on filesystem layout, 
but the same considerations apply in both checkpointed 
and conventional designs. If the runs are uniformly spread 
across the drive, the seeks (on the Cheetah) will come to 
an additional 112 ms, fora total of 231 ms. 


Now consider the same 400k under the checkpointing de- 
sign. KeyKOS and EROS perform this wnte using bulk 
I/O and track at once operations. Depending on the drive, 
400 to 600 kilobytes can be written to the checkpoint area 
in one seek (weighted cost 2.46ms on the Cheetah) plus 
1.5 rotations (1/2 to start, 1 to complete) for a total of 
11.42 ms. We must now consider the cost of migration. 


The KeyKOS/EROS migration I/O behavior looks exactly 
like that of a file cache that uses deferred writes. Because 
the file semantics is unchanged this 1/O has similar run 
lengths, and like the buffer cache flush it is done using 
bulk-sorted I/O. Seek times are amortized similarly be- 
cause there are a large number of available blocks to write. 
The difference is that the migrated blocks have a longer 
time to die, and that the amount of data migrated is there- 
fore half of the data that will be written by the deferred- 
write buffer cache, Because half of the data will die be- 
fore migration, only 56 ms will be spent in rotational de- 
lay rewriting it and 70.47 ms of seek times (again under 
the uniform distribution assumption). The combined total 
cost of the checkpoint and migration writes is 137.89 ms. 


4 Locality and Object Allocation 


There are two primary issues that impact the design of a 
single level store. The first is common to all disk-based 
storage designs: locality. It 1s necessary that the object 
allocation mechanism provide means to arrange the disk- 
level placement of objects for reasonable locality. In all 
generations of the EROS store this 1s accomplished by 
preserving a correlation between OID values and disk po- 
sitions, The second is object allocation: because all ob- 
jects must be recoverable after a crash, all! allocations must 
(logically) be recorded using on-disk data structures. 


4.1 Content Locality 


The value of locality in general-purpose workloads 1s of- 
ten misunderstood. While sequential data placement for 
file content is extremely important in the case of a single 
request stream, it is much less important when multiple 
accesses to disk occur concurrently. Disk-level traces col- 
lected by Ruemmler and Wilkes [RW93] show that it is 
very rare to see more than 8 kilobytes of sequential 1/O 
at the level of the disk arm. While average file sizes have 
grown since that time, and (we presume) modern sequen- 
tial accesses would be longer than 8 kilobytes, the under- 
lying reasons for the limited dynamic sequentiality have 
not changed: 
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e Paging 1/O is limited by the size ofa page. 


e Many file I/Os involve sequentially dependent ac- 
cesses, as when traversing metadata. 


e Directory I/Os are frequent Directories are usu- 
ally small, and the corresponding I/Os are therefore 
short. 


e While read- ahead helps, excessive read ahead is coun- 


terproductive. Successful read-ahead works equally 
well in both designs, and can be thought of as achiev- 
ing a lar ger extent size. 


e The request streams compete for attention at the 
disk arm. Even if a second, potentially sequential 
1/O request is initiated quickly by the application or 
the operating system, the interrupt-level logic has 
already initiated an arm motion if multiple requests 
are present, preventing immediate service of the se- 
quential request. 


Two facts suggest that file system sequentiality may be 
important only up to a limited extent size. Log-structured 
designs organize data by temporal locality rather than spa- 
tial locality. In spite ofthis, read performance for general - 
purpose workloads is not degraded significantly in log- 
structured designs [SSBt 95]. It has also been established 
that file system aging ultimately has amore significant 1m- 
pact on overall lO performance than the logging/cl ustering 
chore [SS95]. 


The EROS store design effectively optimizes for both cases. 
Newly modified objects are stored in the checkpoint area 
according to temporal locality, but the small size of the 
checkpoint area ensures phystal locality as well. Data 
in home locations is placed according to locality deter- 
mined at object allocation time, which is when maximal 
semantic knowledge (and therefore maximal knowledge 
of likely reference patterns) is available. 


While single level stores do not always implement direc- 
tories and indirect blocks in the style of file systems, the 
corresponding concepts are implemented elsewhere in the 
oper ating system, and the basic usage patterns involved 
are ultimately driven by application behavior. Similar ex- 
tent size arguments should thcrefore app ly in single level 
Stores. This introduces a significant degree of freedom 
into file system or single level store design — one that we 
plan to leverage in the next generation of our store. 


4.2 Metadata Locality 


A more pressing issue in the EROS single-level store is 
metadata locality. Inthe Berkeley Fast File System design 
[MJLF8 4], for example, block location occurs through a 


General Track: 2002 USENIX Annual Technixal Conference 


two-stage hybrid translation scheme. The first stage trans- 
lates the inode number to the inode data structure. This 
translation is performed at file open time, and the result 
is cached in an in-memory inode. The second stage tra- 
verses the file indirect blocks to locate individual blocks 
in the file. These indirect blocks are cached in memory ac- 
cording to the same rules as other data blocks, but due to 
higher frequency o faccess are likely to remain in memory 
for active files. 


An EROS address space is a peristent mapping from off- 
sets to bytes. Address spaces need not be associated with 
processes, and EROS therefore uses them to hold file data 
as well as application memory images. As a result, the 
EROS metadata that is most closely analogous to conven- 
tional file metadata is EROS address space metadata. 


An EROS address space is organized as a tree of nodes 
whose leaves are pages (Figure 2), much as a UNIX file 
is organized as a tree of indirect blocks whose leaves are 
data blocks. Because EROS nodes are much “narrower” 
than typical indirect blocks, the height of the address space 
tree for any given address space is taller than the height of 
the indirect block tree for a UNIX file of corrresponding 
file (log32(size) > logoge(stze)). Any tree traversal of 
this type implies sequential disk accesses with associated 
seek delays. The Ruemmler data shows that these seeks 
are frequently interspersed with other accesses when mul- 
tip le request streams are present. 


Request interleave in turn interacts badly with typical, non- 
back tracking disk arm scheduling policies, and can lead to 
as much as a full disk seek before the next block down the 
tree will be fetched? Because of their greater tree height, 
EROS address spaces potentially involve more levels of 
traversal, and it is correspondingly more important to man- 
aging the locality and prefetching of nodes within an ad- 
dress space. The discussion of the EROS space bank be- 
low (Section 5) describes how this locality is achieved. 


4.3 Allocation Performance 


The final performance issue in single level stores is the 
efficiency of object allocation — particularly with respect 
to ephemeral allocations such as heap pages or short-lived 
file content. In aconventional file system, these ephemeral 
blocks are allocated from swap space and do not survive 
system shutdown or failure. Because there is no expec ta- 
tion that these allocations are preserved across restarts, 1n- 
memory data structures and al gorithms can be used to 1in- 
plement them. Linux, for example, keeps an in-memory 
allocation bitmap for each swap area [BCOO]. 


There 1s no way to persistently store ephemeral allocation 


3 Many newer drives implement backtracking seeks, but doing so raises 
both convergence and variance issues that must be avoided by the op- 
erating system in real-time applications. 
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Figure 2: An EROS address space. 


data without incurring some disk I/O overhead. The chal- 
lenge in a single-level store is to keep this overhead to a 
minimum. EROS accomplishes this in two ways: 


1. Norecording of allocation is performed for objects 
that are allocated and deallocated within the same 
checkpoint interval. This largely recaptures the ef- 
ficiency advantages of conventional swap area allo- 
cations. 


2. In the current and previous versions of the EROS 
store, the store 1s divided into regions, each of which 
has an overhead page containing bits that indicate 
whether an object in the region is empty (zero). Se- 
quential allocations first pull in the overhead page, 
but then avoid I/O’s for successive objects within 
the same region. Deallocations update the bit rather 
than the on-disk object, with similar I/O reductions. 


5 The Space Bank 


The EROS storage manager, known as the “space bank,” 
iS a uSer-mode application that performs all storage allo- 
cation in the EROS system. There is a hierarchy of logi- 
cal space banks, all of which are implemented by a single 
server process. Each logical bank: 


e Allocates and deallocates individual pages and nodes 
onrequest. 


e Remembers what objects have been allocated from 
that logical bank, so that they can be bulk-reclaimed 
when the bank is destroyed. 


e Provides locality of allocation so long as this 1s fea- 
sible on the underlying disk, up to the limit of the 
system-designed extent size. 


e Impose optional limits (quotas) on the total number 
of pages and nodes that can be allocated from that 
logical bank. 


e Provides means to create “child” banks whose stor- 
age comes from the parent, creating a hierarchy of 
storage allocation. 


5.1 Extent Caching 


A naive implementation of the space bank would allocate 
one object at a time, recording each allocation in some 
suitable ordered collection. Typically, each dynamically 
allocated object in the system has associated with it at 
least one logical bank. For example, most address spaces 
are implemented as “copy on write” versions of some ex- 
isting space, and the copied pages (and the nodes that ref- 
erence them) are allocated from a per-space bank. One 
impact of this 1s that space bank invocations are frequent. 
As a result the single object approach would not provide 
good locality. 


An obvious solution would be to allocate storage to each 
bank in extents, and allow the bank to suballocate objects 
from this extent. Unfortunately, this doesn’t work well 
either. If we imagine that each extent contains 64 pages, 
and that there 1s some variation in address space sizes, 
we must conclude that when each address space, process, 
or other synthesized object has been completely allocated 
there would remain within its bank a partially allocated 
extent. In the absence of empirical data, we should ex- 
pect that this residual extent would on average be half al- 
located. Unfortunately, there is no simple way to know 
which banks are done allocating. This means that there 
would be a very large number of outstanding banks (one 
per process, one per file, etc.) each of which has commit- 
ted to 1t 32 page frames of disk storage that will never be 
allocated. 


The solution to this 1s extent caching. Instead of associ- 
ating an individual extent with each bank, the space bank 
maintains a cache ("128 lines) of “active” extents. Each of 
these extents begins at an OID corresponding to a 64 page 
boundary on the disk and contains 64 page frames worth 
of OIDs (some of which may already be allocated). Ex- 
tent caches are typed: nodes and pages are allocated from 
distinct extents, which helps to preserve metadata locality. 
The extent cache design relies on the fact that sequential 
OIDs correspond with high likelihood to sequential disk 
locations. 


Every bank is associated with a line in the extent cache by 
a hash on the address of the logical bank data structure. 
When a bank needs to allocate an object, it first checks 
availability in its designated cache line. If that extent has 
no available space, then a “cache miss” occurs, and an 
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attempt is made to allocate a fresh extent from the un- 
derlying disk space. If this proves impossible, as when 
disk space 1s near exhaustion, the needed object will be 
allocated from the first extent in the extent cache that has 
available space. 


The effect of the extent cache is to ensure that banks re- 
ceive sequential objects in probabilistic fashion up to the 
limits of the extent size. While it is possible fortwo banks 
that are simultaneously active to hash to the same extent, 
we have not observed it to be a problem. A secondary 
effiect of the extent cache 1s that the disk page frame allo- 
cation map is consulted with reduced frequency. This ts 
desirable because consulting the allocation map involves a 
linear search through a page and consequently flushes the 
CPU datacache, which has a significant impact on alloca- 
tion speed. Our first space bank implementation did this, 
and we found that the cost of data cache reconstruction 
after allocation overwhelmed all other costs. 


When objects are deallocated, they are restored to the ex- 
tent cache only if the containing extent 1s still in the cache. 


Otherwise, they areretumed directly to the free map. Newly 


freed objects are reused aggressively, because reusing ob- 
jects that are still in memory eliminates extra disk I/Os 
that would record their deallocation. Reuse of old objects 
is deferred. The assumption behind this is that the ob jects 
allocated to a given bank share a common temporal ex- 
tent and will tend to be deallocated as a group. Given this, 
it is better to wait as long as possible before reusing the 
available space in an older extent in order to maximize the 
likelihood that the entire extent has become free. 


To support address space metadata locality, the space bank 
implements a two-level allocation scheme for nodes. The 
extent cache caches page frames. The space bank sub- 
allocates nodes sequentially from these frames. Because 
address spaces are constructed by copy on write methods, 
and because the copy on write process proceeds top down 
in the node tree (Figure 2), it is usual for the entire path of 
nodes from the root to the first page to be allocated from 
a single page frame on the disk. When the top node is 
fetched, its containing page frame is cached in the page 
cache. The effect of this is that the entire sequence of 
nodes from the address space root node to the referenced 
page is brought into memory with a single disk 1/O. 


5.2 Record-Keeping Locality 


Conceptually, each space bank’s record of allocated ob- 
jects can be kept by any convenient balanced tree struc- 
ture. The current EROS implementation uses a red-black 
tree. There are two potential complications that need to 
be considered in building this tree. 


The first is sheer size. Collectively, the number of RB-tree 
nodes is on the same order as the number of disk objects. 
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While these structures might fit within the space bank’s 
virtual address space on current machines, they certainly 
will not fit within physical memory in large system con- 
figurations. However clever the data structure, care must 
be taken to ensure that traversals of these structures do 
not suffier from poor locality due to space bank heap frag- 
mentation. This type of poor locality translates directly 
into paging. The current space bank implementation does 
not attempt to manage this issue, which is a potentially 
serious flaw. A simple solution would be to allocate tree 
nodes using an extent caching mechanism similar to the 
one already used for nodes and pages, or a slab-like allo- 
cation mechanism [Bon94, BAO! J. 


The second 1s space overhead. Each OID occupies 64 
bits, and it would be disproportionate to spend an ad- 
ditional two or three pointers per object to record allo- 
cations. As a result, the current bank tree nodes record 
extents rather than OIDs, and use a per-extent allocation 
bitmap to record which objects within an extent have been 
allocated. Even if two or three banks are simultaneously 
performing allocations from the same extent cache en- 
try, the net space overhead of this is lower than per-OID 
recording. The current implementation relies on every- 
thing fitting within virtual memory. On the Intel x86 fam- 
ily implementation, this will be adequate until the attached 
disk space exceeds 2.6 terabytes. 


6 Two Early Disk-Level Designs 


The original EROS system, including its store design, fol- 
lowed the published design of KeyKOS [Har85]. Each 
disk 1s divided into ranges of sequentially numbered ob- 
jects. Ranges are partitioned by object type; a given range 
contains nodes or pages, but not both. Every object’s 
OID consists of a 64 bit “coded disk address” concate- 
nated with a 32 bit “allocation count.’”” The coded disk 
address describes the location of the object, and the allo- 
cation count indicates how many times a particular object 
has been allocated. In order for a capability to be valid, 
the allocation count in the capability must match the al- 
location count recorded on the disk for the corresponding 
object. 


6.1 Original Storage Layout 


At startup, the kernel probes all disks, identifies the ranges 
present, and builds an in-memory table with an entry for 
each range: 


(node/page, startOID, endOID, disk, startSec) 


It also scans the checkpoint area to rebuild the in-memory 
directory of object locations. 
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For nodes, the allocation count 1s recorded in the node tt- 
self. Nodes are numbered sequentially within arange, and 
are packed in page-sized units called node pots: no node 
on the disk is split across two pages. Once the containing 
range has been identified, the disk location of the relative 
disk page frame containing a node can be computed by 


(OID — startOID)/NodesPerPageFrame 


For pages, the allocation count cannot be recorded within 
the page, because all of the available bytes are already 
in use. Instead, page ranges are further subdivided into 
subranges. Each subrange begins with an allocation pot, 
which ts a page that contains the allocation counts for the 
following pages in the subrange (Figure 3). The allocation 
pot also contains a byte containing various flags for the 
page, including one indicating whether that page !s known 
to hold zeros. 


Allocation Pot ~—————_ (truncated) 
Page Frames | { Page Frames | " Page Ba) 


Figure 3: Page range layout. 


Each allocation pot can hold information for up to 819 
pages, So a page range is organized as a sequence of sub- 
ranges, each 820 pages long and consisting of an allo- 
cation pot followed by its associated pages. Depending 
on the size of the underlying partition, the final subrange 
may be truncated. Once the containing range has been 
identified, the disk location of a relative disk page frame 
containing a page can be computed by 


(OID — startOID) + (OID — startOID)/819 +1 


In either case, the relative frame can then be combined 
with the startS ec value to yield the starting sector for the 
1/O. 


6.2 Unified Object Spaces 


The design of Section 6.] sufters from an irritating flaw: 
it partitions the disk into typed ranges. There is no easy 
way to know in advance the correct proportion of nodes to 
pages, and the design does not provide any simple means 
to reorganize the disk (there are no forwarding pointers) 
or to interconvert ranges from one type to another. We 
found that we were continuously adjusting our directions 
to the disk formatting program to add or remove objects 
of some type. 


The solution was to adopt the page range layout for all 
ranges, and use an available bit in the allocation pot to 
indicate the “type” (node or page) of the corresponding 


disk page frame (Figure 4). If the frame type is “page,” 
the allocation count in the allocation pot is the allocation 
count of the page, otherwise it is the max of the allocation 
counts of all nodes contained in the frame. The OID en- 
coding was also reorganized, using the least 8 bits as the 
index of the object within the frame and the upper 56 bits 
as the “frame OID.” The frame offset computation pro- 
ceeds as previously described for page frames, with the 
caveat that the OID value must be shifted before perfiorm- 
ing the computation. 


Allocation Pot - (truncated) 





Node Frames 


Figure 4: Unified range layout. 


In the revised design, the kernel converts a frame from 
one type to another whenever an object of the “wrong” 
type is allocated by the space bank. The kernel assumes 
that the space bank has kept track of available storage, 
and that it will not unintentionally reallocate storage that 
is already in use. There is a minor complication, which 1s 
that the kernel must ensure that allocation count is never 
decreased by conversion. This is assured by setting the 
allocation count to maz(node alloc counts) + 1 when 
converting a frame from nodes to pages and setting all 
node allocation counts to the page allocation count when 
converting a frame from pages to nodes. 


The switch to unified ranges simplifies the kernel object 
management code, but more importantly it simplifies the 
use and allocation of disk storage. Disk frames can now 
be traded back and forth between types as needed. In 
addition to allowing address space metadata and data to 
be placed in a localized fashion (thereby facilitating read- 
ahead across object types), the new design can potentially 
be extended to perform defragmentation in order to 1m- 
prove extent effectiveness. Individual banks effectively 
record the relationships between pages, nodes, and their 
containing objects, and the ability to retype frames sup- 
ports storage compaction. To perform compaction, two 
additional bits can be taken from the “flags” field to “lock” 
an object, copy it’s content to a new destination frame, use 
the old frame to record the new location, and then use a 
second flags bit to mark the object forwarded. Either the 
space bank ora helper application can now iterate through 
all nodes, rewriting their capabilities to reflect the new ob- 
ject location. 
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7 Embedded EROS 


In early 2000, as part of an exploratory research collabo- 
ration with Panasonic, we started to investigate the possi- 
bility of an embedded version of EROS for selected real- 
time applications. As part of this, a decision was made to 
remove all remaining drivers from the kernel. Together, 
these decisions introduced three new requirements into 
the overall system design: 


1. In orderto support DMA, we needed a way to sup- 
port pages (but not nodes) whose physical memory 
address could be known to a driver. 


2. The kernel now needed some mechanism for alloca- 
tion of non-pageable and non-checkpointed objects. 


3. Ordinary disk ranges now needed to be served by 
user-mode drivers. 


To address these requirements, the “‘range’’ notion was 
generalized to the notion of object sources. An object 
source implements some sequential range of OIDs. This 
range may be only partially populated. 


By well-known convention, two ranges of OIDs are re- 
served. One corresponds to physical memory pages, while 
the other allocates non-pageable objects. EROS already 
uses main memory as an object cache. The physical page 
object source will allocate any OID whose range-relative 
frame index corresponds to a physical memory page frame 
that 1s part of the page cache. The effect of allocating a 
capability for such an OID is to evict the current resident 
of that page cache entry and relabel the entry as a physi- 
cal page object. The non-pageable object range is similar, 
though there is no guarantee that the object will occupy 
any particular physical address. Both physical memory 
pages and non-pageable objects are exempted from check- 
pointing and eviction. When these objects are freed, the 
corresponding cache locations are returned to the object 
cache free pool for later reuse. 


In the embedded design, the system 1s partitioned into a 
nompersistent space that contains drivers and the object 
store manager, and a persistent space that operates ex- 
actly as before. The driver portion of the system is loaded 
from ROM, and uses an obJsect source registry capabil- 
ity to register support for a persistent range if one 1s to 
be implemented. The persistent O1D range (if present) is 
“backed” by a user-mode object source driver, and the ker- 
nel defines a protocol by which this driver can insert or re- 
move objects whose OIDs fall within the range it controls. 
When completed, there will also be a protocol by which 
the persistent source driver indicates how many dirty ob- 
jects can be permitted for its range at any given time. The 
implementation of the checkpoint mechanism 1n this de- 
sign is relocated to the persistent source driver; the kernel 
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remains responsible for snapshot and for “writing back” 
the checkpointed objects to the persistent source driver. 


8 The Vertical View 


As with conventional file systems, the effectiveness of a 
single-level store design relies on the interaction of tem- 
poral, spatial, and referential efficiencies implemented co- 
operatively by several vertical layers in the system. This 
section briefly recaps the critical points in a single place 
so that their combined effect can be more readily seen. 
Each item Is annotated by the section that discusses It. 


Temporal Efficiency: 


e The kernel ensures that objects that are allocated 
and deallocated within a single checkpoint interval 
generate no I/O to home locations provided that ca- 
pabilities to them are never written to the disk. [4.3] 


e The space bank eagerly reuses young, dead objects 
to reduce unnecessary recording of deallocations, 
and to aggressively reuse allocation pots that it knows 
must already be in memory. [5.1] 


Spatial Efficiency: 


e The space bank allocates objects using bank-wise 
extents, which helps to preserve disk-level locality. 
Separate extents are used for pages and nodes. [5.1] 


e Theres a direct correspondence between OIDs and 
page frame placement in the store, eliminating the 
need for directory or indirection blocks in the object 
store. [6.1,6.2] 


Referential Efficiency: 


e The address space copy on write implementation 
combines a dedicated bank (and therefore a dedi- 
cated extent) with top-down metadata traversal, en- 
suring that all “indirect blocks” in a given traversal 
will tend to be fetched in a single 1/O operation. 
[4.2] 


e Both the checkpoint and the migration systems use 
bulk, sorted 1/O, reducing total seek latency in spite 
of performing a larger number of object writes. [3.3] 


e Empty (zero) objects are neither written nor read to 
the home locations — only their allocation pots are 
revised. [6.1] 


e Both the checkpoint directory [3.2 ] and the range 
table [6.1] are kept in memory. No additional disk 
I/Os are required to determine the location of a tar- 
get page or node. 
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The combined effect of this may be illustrated by describ- 
ing in more detail what happens when an object is to be 
loaded. 


When a page or node is to be fetched, the EROS kernel 
first consults an in-memory object hash table to determine 
of the object is already in memory. This includes checking 
for the containing node pot or allocation pot as appropri- 
ate. Next, the checkpoint area directory 1s consultedto see 
if the current version of this object 1s located in the check- 
point area. If the object is not found in the checkpoint 
area, the range table is consulted and an I/O is initiated 
for the objects containing page frame and (if needed) its 
allocation pot. In the typical case, ignoring read-ahead, 
only one J/O is performed. The effectiveness of the node 
allocation strategy tends to yield one I/O for every 7 nodes 
(because 7 nodes fit in a node pot). Similarly, the alloca- 
tion pot I/O is performed only once for a given 819 framie 
region in the home locations; the overhead of these 1/Os 
is neglhigable. 


The total effectiveness of the single-level store relies on 
collaboration between the storage allocator, its clients, the 
kernel, and the underlying store design. Each of these 
pieces, taken individually, is relatively straightforward. 


Conceptually, this layering is not so different from what 
happens ina file system. An important difference is that in 
the filesystem design this layering is opaque. In EROS, it 
is straightforward to implement customized memory man- 
agers or use multiple space banks for more explicit extent 
management. 


9 Related Work 


While the idea of a single-level store is widely known 
among operating system i1mplementors, relatively little has 
been published about their design. As mentioned in the in- 
troduction, the AS/400 is perhaps the best known imple- 
mentation, but the only widely available reference on this 
design [So196] provides inadequate details. Even within 
IBM, information on the AS/400 implementation is closely 
held. 


Both Grasshopper [DdBF*t 94] and Mungi [HEV* 98] use 
single-level stores. Neither has published details on the 
storage system itself. Like the AS/400, persistence in the 
Mung system is explicitly managed. Grasshopper’s 1s 
transparent, but its strategy for computing transitive de- 
pendencies is both complex and expensive. 


Consistent checkpointing has been the subject of several 
previous papers, most notably work by Elnozahy et al. 
[EJZ92] and Chandy and Lamport [CL85]. 


KeyKOS, from which the EROS design is derived, uses 
a single level store and consistent checkpoint mechanism 
described in [Lan92]. The design of the store itself has 


never previously been published. 


The L3 system [Lie93] implemented a transparent check- 
pointing mechanism in its user-level address space man- 
ager. Like the KeyKOS and EROS checkpointing designs, 
this implementation uses asynchronous copy on write for 
interactive responsiveness. Fluke similarly implemented 
an experimental system-wide checkpoint mechanism at 
user level [TLFH96], but this implementation is a “stop 
and copy” implementation. Disk writes are performed 
before execution can proceed making the Fluke imple- 
mentation unsuitable for interactive or real-time applica- 
tions. Neither the L3 nor the Fluke checkpointers per- 
form any sort of systemwide consistency check prior to 
writing a checkpoint, introducing the likelihood that sys- 
tem state errors resulting from either imperfect implemen- 
tation or ambient background radiation will be rendered 
permanent. 


The checkpoint design presented here is similar in many 
respects to the behavior of log-structured file systems such 
as LFS [SBMS93, MR92]. As with log-structured file 
systems, the the checkpoint mechanism converts random 
writes into sequential writes. Unlike the log-structured 
design, the EROS checkpointing design quickly converts 
this temporally localized data into physically localized data 
by migrating it into locations that were allocated based on 
desired long-term locality. The resulting performance re- 
mains faster than conventional file systems, but does not 
decay as file system utilization increases. 


10 Future Work 


The store designs described in sections 6 and 7 reflect 
a mature placement strategy that has been tested over a 
long period of time. While effective, this placement strat- 
egy suffers from a significant limitation: it is difficult to 
administer changes to the underlying disk configuration. 
While the EROS system implements software duplexing 
to allow storage to be rearranged, the rearrangement pro- 
cess 1s neither efficient nor “hands free.” It would be better 
to have an automated means to take advantage of new and 
larger stores. 


The current EROS space bank implementation can theo- 
retically handle stores slightly larger than 229 pages (2.6 
terabytes). This corresponds approximately to one fully- 
populated RAID subsystem containing 10 drives, each pro- 
viding 60 gigabytes of storage. While not common on 
the desktop, these configurations appear more and more 
frequently on servers. The paging behavior of the cur- 
rent space bank implementation would be quite bad for 
this size store. While the space bank could be reimple- 
mented to eliminate thrashing during allocation, a better 
approach overall would be to loosen the association be- 
tween OIDs and disk page frames. Extent-level locality 
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is essential, but a sparsely allocatable OID space would 
eliminate the need for the red-black trees that are currently 
used to record object allocations. 


From an addressing standpoint, larger stores do not present 
an immediate problem for EROS. The underlying OID 
space can handle stores of up to 2°© pages.* Given that 
there is no basic addressability problem, the key question 
is “how will we manage the growth?” The sequential al- 
location strategy used by the space bank does not do an 
effective job of balancing load over disk arms in the ab- 
sence of a RAID controller. Further, the current mapping 
Strategy from OIDs to disk page frames does not lend it- 
self to physical rearrangement of objects as the available 
storage grows. All of these issues point to a need for a 
logical volume mechanism. 


The next and (we hope) final version of the EROS single 
level store will likely be based on randomization-based 
extent placement. This is a nearly complete departure 
from the designs described here: 


e Node and page OID spaces are once again parti- 
tioned. 


e Allocation counts are abandoned. OIDs can be re- 
allocated only if the space bank knows that no ca- 
pability using the OID exists on! the disk. 


e The direct map from OJDs to disk ranges has been 
abandoned entirely. Objects are now placed using a 
randomization-based strategy. 


e While extent-based object placement continues to 
be honored on a “best effort’ basis using low-order 
bits of the OID, there is no longer any direct associ- 
ation between an OID and the location of an object 
on disk. 


e Rangescan be dynamically grownor shrunk as disks 
are added and removed. 


e Where the previous operations on ranges were “al- 
locate” and “deallocate,” the new design separates 
object allocation into two parts: storage reservation 
and object name binding. 


The inspiration for this departure is a new, randomization- 
based disk placement strategy being explored within the 


new Strategy, disks or partitions can be added or removed 
from the system at will without needing to garbage collect 
or renumber OIDs, and data can be transparently shifted 
to balance load across available disk arms. This renders 
the system more easily scalable, and provides the type of 
load balancing and latency properties needed for multi- 
media applications. 
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12 Conclusions 


This paper describes three working implementations of 
a single-level store. To our knowledge, this 1s the first 
time that any single-level store design has been compre- 
hensively described in the public literature. The code 1s 
available online, and can be found at the EROS web site 
[Sha]. In describing our design, we have attempted to 
identify both the critical performance issues that arise in 
single level store designs and the solutions we have found 
to those issues. 


One key to an effective single-level store is the interac- 
tion between temporal, spatial, and referential efficiency. 
This is made possible in EROS by the fact that disk-level 
locality information 1s rendered directly available for ap- 
plication use. Where file systems make locality decisions 
at the time the file is closed and the file cache 1s flushed, 
the EROS space bank makes these decisions when the cor- 
responding storage is al/ocated, which is the point where 
maximal semantic knowledge of intended usage 1s at hand. 


Systems Research Laboratory based on prior work by Brink- apn interesting challenge in the randomization-based de- 


mann and Scheideler [BSSO0]. We plan to adopt a single- 
disk variant of this strategy in the EROS single-level store. 


The key motivation for this change is the ability to di- 
vorce OIDs from physical placement without losing the 
ability to directly compute object addresses. Under the 


4 The disk drive industry has yet to produce 25° pages of total disk 
storage over the lifespan of the industry, but will soon cross this mark. 
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sign is to preserve an effective balance between spatial 
locality and adaptive scalability. 


A surprising attribute of the EROS single level store is that 
in spite of its vertical integration it has undergone several 
major changes with minimal application-level impact. We 
have changed the capability size, the OID encoding, the 
checkpoint design, and removed the object driver from 
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the kernel. The only application code that changed was 
the space bank. Within the kernel itself, even the cache 
management code has gone largely unchanged as these 
modifications occurred. 


From a design perspective, this paper has illustrated that 
single level stores simplify operating system design by re- 
moving an unnecessary layer of abstraction from the sys- 
tem. Instead of implementing a new and different seman- 
tics at the file system layer, a single level store extends the 
memory mapping model downwards to include the disk, 
allowing applications to control placement directly. In 
EROS these placement controls are generally provided by 
standard fault handling programs; most applications sim- 
ply use these handlers, and require no code at all for stor- 
age management - separation of concerns is effectively 
maintained. On the other hand, applications with unusual 
requirements can replace these fault handlers if needed. 
The total EROS system size is roughly 25% that of Linux. 


Microbenchmarks [SSF99] show that performance-critical 
object allocations in EROS are fast. Hand examination 
shows that the mechanisms described here are actually 
generating good disk-level locality. EROS-specific bench- 
marks show that EROS makes effective use of the avail- 
able sustained disk bandwidth. In practice the main prob- 
lem with checkpointing seems to finding a heuristic that 
does the associated I/Os s/ow/ly enough to avoid inter- 
fering with interactive processing. We also know that 
the KeyKOS database system, whose disk performance is 
critical, delivered exceptionally strong performance. All 
this being said, a key missing piece in this paper 1s appli- 
cation-level benchmarks. We are in the process of porting 
several server and client applications to EROS, and plan 
to measure application-level perfonnance when this has 
been done. 


This paper is dedicated to the memory of Prof. Dr. Jochen 
Liedtke (1953-2001). 
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Abstract 


Building a flexible kernel from components Is a promis- 
ing solution for supporting various embedded systems. 
The use of components encourages code re-use and re- 
duces development time. Flexibility permits the system 
to be configured at various stages of the design, up to run 
time. In this paper, we propose a software tramework, 
called THINK, for implementing operating system ker- 
nels from components of arbitrary sizes. A unique fea- 
ture of THINK is that it provides a uniform and highly 
flexible binding model to help OS architects assemble 
Operating system components in varied ways. An OS 
architect can build an OS kernel from components us- 
ing THINK without being forced into a predefined ker- 
nel design (e.g. exo-kernel, micro-kernel or classical OS 
kernel). To evaluate the THINK framework, we have 1m- 
plemented KORTEX, a library of commonly used kernel 
components. We haveused KORTEX to implement sev- 
eral kernels, including an L4-like micro-kernel, and ker- 
nels for an active network router, for the Kaffie Java vir- 
tual machine, and for a Doom game. Perfonnance mea- 
surements show no degradation due to componentization 
and the systematic use of the binding framework, and 
that application-specific kernels can achieve speed-ups 
over standard general-purpose operating systems such as 
Linux. 


1 Introduction 


Embeddedsystems, such as low-end appliances and net- 
work routers, represent a rapidly growing domain of sys- 
tems. This domain exhibits specific characteristics that 
impact OS design. First, such systems run one or only 
a small family of applications with specific needs. Sec- 
ond, for economic reasons, memory and CPU resources 
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are scarce. Third, new hardware and software appear at 
a rapid rate to satisfy emerging needs. Finally, systems 
have to be flexible to support unanticipated needs such 
as monitoring and tuning. 


Implementing an operating system kernel for such sys- 
tems raises several constraints. Development time 
should be as short as possible; this encourages system- 
atic code re-use and implementation of the kernel by as- 
sembling existing components. Kernel size should be 
minimal; only services and concepts required by the ap- 
plications for which the kernel is targeted should be em- 
bbeded within the kernel. Efficiency should be targeted; 
no specific hardware feature or low-level kernel func- 
tionality should be masked to the application. Finally, 
to provide flexibility, it must be possible to instantiate 
a kernel configuration at boot time and to dynamically 
download a new component into the kernel. To support 
these features, it should be possible to resolve the bind- 
ings between components at run time. 


Building flexible systems from components has been 
an active area of operating system research. Previous 
work includes micro-kernel architectures [2, 14, 25], 
where each component corresponds to a domain bound- 
ary (i.e. server), extensible systems such as SPIN [3} 
that support dynamic loading of components written in 
a type-safe language, and more recently the OSKit [7] 
or eCos [5] which allow the re-use of existing system 
components. One problem with the existing component- 
based approaches lies in the predefined and fixed ways 
components can interact and be bound together. While 
it is possible to create specific components to implement 
a particular form of binding between components, there 
are no supporting framework or tools to assist this devel- 
opment, and little help to understand how these different 
fons of binding can be used and combined in a consis- 
tent manner. 


By binding we mean the end result of the general process 
of establishing communication or interaction channels 
between two or more objects. Bindings may take sev- 
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eral formsand range from simple pointers or in-address- 
space references, to complex distributed channels estab- 
lished between remote objects involving multiple lay- 
ers of communication protocols, marshalling and unmar- 
shalling, caching and consistency management, etc. As 
argued in [27], the range and forins of bindings are so 
varied that it is unlikely that a single generic binding 
process or bindingtypecan be devised. This in turn calls 
for framework and tool support to help system develop- 
ers design and implement specific forms of binding. 


The attention to flexible binding is not new. Several 
works, e.g. [4, 10, 27], have proposed flexible bind- 
ing models for distributed middleware platforms. The 
Nemesis operating system [13] introduces bindings for 
the handling of multimedia communication paths. The 
path abstraction in the Scout operating system [19] 
can be understood as an in-system binding abstraction. 
And channels in the NodeOS operating system interface 
for active routers [22] correspond to low-level packet- 
communication-oriented bindings. None of these works, 
however, has considered the use of a general model of 
component binding as a framework for building differ- 
ent operating system kernels. 


This paper 


This paper presents the THINK! framework for build- 
ing operating system kernels from components of arbi- 
trary sizes. Each entity in a THINK kernel is a compo- 
nent. Components can be bound together in different 
ways, including remotely, through the use of bindings. 
Bindings are themselves assemblies of components that 
implement communication paths between one or more 
components. This structuring extends to the interaction 
with the hardware, which is encapsulated in Hardware 
Abstraction Layer (HAL) components. 


The contributions of this paper are as follows. We pro- 
pose a software architecture that enables operating sys- 
tem kernels to be assembled, at boot-time or at run-time, 
from a library of kemel components of arbitrary size. 
The distinguishing feature of the framework ts its flexi- 
ble binding model that allows components to be bound 
and assembled in diffierent and non-predefined ways. 


We have designed and implemented KORTEX, a h- 
brary of kernel components that offers several models 
of threads, memory management and process manage- 
ment services. KORTEX implements different forms of 
bindings, including basic forms such as system calls 


I TWINK stands for Think Is Not a Kernel 
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(syscalls), up-calls, signals, IPC and RPC calls. We have 
used KORTEX to implement L4-like kernel services. Our 
benchmarks show excellent performance for low-level 
system services, confimiing that applying our compo- 
nent model and our binding model does not result in de- 
graded performance compared to non component-based 
kernels. 


We have used KORTEX to implement operating system 
kernels for an active network router, the Kaffe Java vir- 
tual machine, and a Doom game. We have evaluated the 
performance of these kernels on a Macintosh/PowerPC 
machine. Our benchmarks show that our kernels are at 
least as efficient than the implementations of these ap- 
plications on standard monolithic kernels. Additionally, 
our kernels achieve small foot-prints. Finally, although 
anecdotal, our experience in using the THINK frame- 
work and the KORTEX library suggests interesting bene- 
fits in reducing the implementation time of an operating 
system kernel. 


The rest of the paper is structured as follows. Section 2 
discusses related work on component-based kernels and 
OSes. Section 3 details the THINK software framework, 
its basic concepts, and its implementation. Section 4 
describes the KORTEX library of THINK components. 
Section 5 presents several kernels that we have assem- 
bled to support specific applications and their evaluation. 
Section 6 assesses our results and concludes with future 
work. 


2 Related Work 


There have been several works in the past decade on 
flexible, extensible, and/or component-based operating 
system kermels. Most of these systems, however, be 
they research prototypes such as Choices and ;t-Choices 
[31], SPIN [3], Aegis/Xok [6], VINO [26], Pebble [8], 
Nemesis [13], 2K [12], or commercial systems such as 
QNX [23], VxWorks [34], or eCos [5], still define a 
particular, fixed set of core functions on which all of 
the extensions or components rely, and which implies 
in general a particular design for the associated family 
of kernels (e.g. with a fixed task or thread model, ad- 
dress space model, interrupt handling model, or commu- 
nication model). QNX and VxWorks provide optional 
modules that can be statically or dynamically linked to 
the operating system, but these modules rely on a basic 
kernel and are not designed according to a component- 
based approach. eCos supports the static configuration 
of componentsand packages of components into embed- 
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ded operating systems but relies on a predefined basic 
kernel and does not provide dynamic reconfiguration ca- 
pabilities. 


In contrast, the THINK framework does not impose a 
particular kernel design on the OS architect, who is free 
to choose e.g. between an exo-kernel, a micro-kernel 
or a classical kernel design, a single or multiple ad- 
dress space design. In this respect, the THINK approach 
is similar to that of OSKit [7], which provides a col- 
lection of (relatively coarse-grained, COM-like) com- 
ponents implementing typical OS fnnctionalities. The 
OSKit components have been used to implement sev- 
eral highly specialized OSes, such as implementations of 
the programming languages SML and Java at the hard- 
ware level [15]. OSKit components can be statically 
configured using the Knit tool [24]. The Knit compiler 
modifies the source code to replace calls across compo- 
nent boundaries by direct calls, thus enabling standard 
compiler optimizations. Unlike THINK however, OS- 
Kit does not provide a framework for binding compo- 
nents. As aresult, much of the common structures which 
are provided by the THINK framework have to be hand- 
coded in an ad-hoc fashion, hampering composition and 
reuse. Besides, we have found in practice that OSKit 
components are much too coarse-grained for building 
small-footprint, specific kemels that impose no partic- 
ular task, scheduling or memory management model on 
applications. Other differences between THINK and OS- 
Kit include: 


e Component model: THINK has adopted a compo- 
nent model inspired by the standardized Open Dis- 
tributed Processing Reference Model (ODP) [1], 
whereas OSKit has adopted Microsoft COM com- 
ponent model. While the two component models 
yield similarrun-time structures, and impose as few 
constraints on component implementations, we be- 
lieve that the THINK model, as described in section 
3.1 below, provides more flexibility in dealing with 
heterogeneous environments. 


e Legacy code: OSKit provides several libraries 
that encapsulate legacy code (e.g. from FreeBSD, 
Linux, and Mach) and has devoted more attention 
to issues surrounding the encapsulation of legacy 
code. In contrast, most components in the KORTEX 
library are native components, with the exception 
of device drivers. However, techniques similar to 
those used in OSKit (e.g. emulation of legacy envi- 
ronments in glue code) could be easily leveraged 
to incorporate in KORTEX coarse-grained legacy 
components. 


e Specialized frameworks: in contrast to OSKit, 


the KORTEX library provides additional software 
frameworks to help structure kernel functionality, 
namely a resource management framework and a 
communication framework. The resource manage- 
ment framework ts original, whereas the communi- 
cation framework is inspired by the xx-kernel [11]. 


Other operating system-level component-based firame- 
works include Click [18], Ensemble [15] and Scout [19]. 
These frameworks, however, are more specialized than 
THINK or OSKat: Click targets the construction of mod- 
ular routers, Ensemble and Scout target the construction 
of communication protocol stacks. 


We thus believe that THINK Is unique in its introduction 
and systematic application of a flexible binding model 
for the design and implementation of component-based 
operating system kernels. The THINK component and 
binding models have been inspired by various works 
on distributed middleware, including the standardized 
ODP Reference Model [1], ANSA [10], and Jonathan 
[4]. In contrast to the latter works, THINK exploits flexi- 
ble binding to build operating system kernels rather than 
user-level middleware libraries. 


3 THINK Software Framework 


The THINK software framework is built around a small 
set of concepts, that are systematically applied to build 
a system. These concepts are: components, interfaces, 
bindings, names and domains. 


A THINK system, 1.e. a system built using the THINK 
software framework, is composed of a set of domains. 
Domains correspond to resource, protection and isola- 
tion boundaries. An operating system kernel executing 
in privileged processor mode and a set of user processes 
executing in unprivileged processor mode are examples 
of domains. A domain comprises a set of components. 
Components interact through bindings that connect their 
interfaces. Domains and bindings can themselves be 
reified as components, and can be built by composing 
lower-level components. The syscall bindings and re- 
mote communication bindings described in section 4 are 
examples of composite bindings, 1.e. bindings composed 
of lower-level components. Bindings can cross domain 
boundaries and bind together interfaces that reside in 
diffierent domains. In particular, components that con- 
stitute a composite binding may belong to diffierent do- 
mains. For example, the aforementioned syscall and re- 
mote communication bindings cross domain boundaries. 
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3.1 Core software framework 


The concepts of component and interface in the THINK 
framework are close to the concepts of object and in- 
terface in ODP. A component is a run-time structure 
that encapsulates data and behavior. An interface is a 
named interaction point of a component, that can be of 
a server kind (i.e. operations can be invoked on It) or of 
a client kind (i.e. operations can be invoked from it). A 
component can have multiple interfaces. A component 
interacts with its environment, i.e. other components, 
only through its interfaces. All interfaces in THINK are 
strongly typed. In the current implementation of the 
THINK framework, interface types are defined using the 
Java language (see section 3.2). Assumptions about the 
interface type system are minimum: an interface type 
documents the signatures of a finite set of operations, 
each operation signature containing an operation name, 
a set of arguments, a set of associated results (including 
possible exceptions); the set of interface types forms a 
lattice, ordered by a subtype relation, allowing multiple 
inheritance between interface types. The strong typing 
of interfaces provides a first level of safety in the assem- 
bly of component configurations: a binding can only be 
created between components if thcir interfaces are type 
compatible (i.e. are subtypes of one another). 


An interface in the THINK framework is designated by 
a name. Names are context-dcpendent, i.e. they are rel- 
ative to a given naming context. A naming context en- 
compasses a set of created names, a naming convention 
and a name allocation polrey. Naming contexts can be 
organized in naming graphs. Nodes in a naming graph 
are naming contexts or other components. An edge in 
a naming graph is directed and links a naming context 
to a component interface (which can be another naming 
context). An edge in a naming graph is labelled by a 
name : the name, in the naming context that is the edge 
source, of the component interface that is the edge sink. 
Given a naming graph, a naming context and a compo- 
nent interface, the name of the component interface in 
the given naming context can be understood as a path in 
the naming graph leading from the naming context to the 
component interface. Naming graphs can have an arbi- 
trary forms and need not be organized as trees, allowing 
new contexts to be added to a naming graph dynami- 
cally, and different naming conventions to coexist (a cru- 
cial requirement when dealing with highly heterogenous 
environments as may bethe case with mobile devices). 


Interaction between components is only possible orce 
a binding has been established between some of their 
interfaces. A binding is a communication channel be- 
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tween t wo or more components. This notion covers both 
language-level bindings (e.g. associations bet ween lan- 
guage symbols and memory addresses) as well as dis- 
tributed sy stem bindings (e.g. RPC or transactional bind- 
ings bet ween clients and possibly replicated servers). In 
the THINK framework, bindings are created by special 
factory components called binding factories. A binding 
typically embodies communication resources and imp le- 
ments a particular communication semantics. Since sev- 
eral binding factories may coexist in a given THINK sys- 
tem, it is possible to interact with a component accord- 
ing to various communication semantics (e.g. local or 
remote; standard point-to-point at-most once operation 
invocation; component invocation with monitoring, with 
access control, with caching; event casting a la SPIN; 
etc). Importantly, bindings can be created either imp lic- 
itly, e.g. as in standard distributed object systems such 
as Java RMI and CORBA where the establishment of a 
binding is hidden from the component using that bind- 
ing, or explicitly, i.e. by invocation of a binding fa- 
tory. Explicit bindings are required for certain classes 
of applications such as multimedia or real-time applic a- 
tions, that impose explicit, applic ation-depend ent qual- 
ity of service constraints on bindings. Creating a binding 
explicitly results inthe creation of a binding component, 
i.e. acomponent that reifies a binding. A binding com- 
ponent can in turn be monitored and controlled by other 
conipo nents. 


interface Top { } 
interface Name { 
NamingContext get NC(); 
String toByte(); 
interface NamingContext { 
Name export(Top itf, char[] hint); 
Name byteToName(String name); 
} 
interface BindingFactory { 
Top bind(Name name, char[] hint); 
j 


Figure 1: Framework for interfaces, names and bindings 


These concepts of naming and binding are manifested 
in the THINK software framework by the set of Java in- 
terface declarations shown in Figure |. The type Top 
corresponds to the greatest element of the type lattice, 
i.e. all interface types are a subtype of Top (all inter- 
face types in THINK “extend” Top). The type Name 
is the supertype of all names in THINK. The operation 
getNC yields the naming context to which the name be- 
longs (i.e. the naming context in which the name has 
been created through the export operation). The op- 
eration toByte yields a simple serialized form of the 
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name instance. 


The type NamingContext Is the superty pe of all nam- 
ing contexts in THINK. The operation export creates 
a new name, which is associated tothe interface passed 
as a parameter (the hint parameter can be used to pass 
additional information, such as type or binding data, re- 
quired to create a valid name). As a side-effect, this 
operation may cause the creation of (part of) a binding 
with the newly named interface (e.g. creating a server 
socket in a standard dstributed client-server setting). 
The operation byteToName returns a name, upon re- 
ceipt of a serialized form for that name. This opera- 
tion is guaranteed to work only with serialized forms of 
names previously exported from the same naming con- 
text. The type NamingContext sets minimal req ulre- 
ments for a naming context in the framework. More spe- 
cific forins of naming contexts can be introduced if nec- 
essary as subtypes of NamingContext (e.g. adding a 
resolve operation to traverse a nammng graph). 


The type BindingFactory is the supertype of all 
binding factories in THINK. The operation bind cre- 
ates a binding with the interface referenced by the name 
passed as a parameter (the hint parameter can be used 
to pass additional information required to establish the 
bind ing, e.g. type or quality of service information). Ac- 
tual binding factories cantypically add more specialized 
bind operations, e.g. adding param eters to characterize 
the quality of service required from the binding or re- 
turning specific interfaces for controlling the newly con- 
structed binding. 


3.2. Implementing the THINK framework 


In our current prototype, THINK components are written 
in C for efficiency reasons. An interface is represented 
by an interface descriptor structure, whose actual size 
and content are unknown to the client, and which con- 
tains a pointer to the code implementing the interface 
operations, as shown, in figure 2. This layout is similar 
toa C+ virtual function table. 


interface 
methods 


interface 
descriptor 


interface 
reference 








tation 
implemen 
tation 


Figure 2: Run-time interface repres entation 


The exact location of private component data ts the re- 
sponsiblity of the component developer. Depending on 
the nature of the target component, the implem entation 
supports several optimizations of the structure ofthe in- 
terface representation. These optimizations help reduce, 
for instance, memory and allocation costs when han- 
dling interface descriptors. They are depicted in figure 
3. 1f the component is a singleton, i.e. there is no other 
component in the given domain implementing the same 
interface, then the interface descnptor and the compo- 
nent private data can be statically allocated by the com- 
piler. If the compament is not a singleton but has only 
one interface, then the private data of the component can 
be allocated directly with the interface descnptor. Fi- 
nally, in the general case, the interface descriptor is a 
dynamic structure containing a pointer to the interface 
operations and an additional pointer to the component’s 
private data. In the component library described in sec- 
tion 4, most components are either singletons or have a 
single interface, and are implemented accordingly. 


component with one interface 
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Figure 3: Optimization on interface representation 


This implementation abides by C compiler ABI calling 
conventions [35] Thus, arguments in a PowerPC im- 
plementation are passed on the stack and in registers to 
Improve performance. It is important to notice that all 
calls to a component interface are express ed in the same 
way, regardless of the underlying binding type and the 
location of the component. For example, a server com- 
ponent in the local kemel domam is called in the same 
way aS a server component in a remote host; only the 
binding changes’. 


3.3 Code generation and tools 


Building a particular kernel or an application using the 
THINK framework is aided by two main off-line tools. 


e An open interface compiler, that can be specialized 


?This does not mean that the client code need not be prepared to 


handle the particular semantics associated with a binding, e.g. han- 
dling exceptions thrown by a remote binding component in case of 
communication failures. 
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to generate code from interface descriptions written 
in Java. For instance, it is used to generate C dec- 
larations and code that describe and produce inter- 
face descriptors, and to generate components (e.g. 
stub components) used by binding factories to cre- 
ate new bindings. This generated code can contain 
assembly code and exploit the specific features of 
the supporting hardware. 


e An off-line configurator, that creates kernel images 
by assembling various component and binding }1- 
braries. This tool implicitly calls a linker (such 
as 1d) and operates on a component graph spec- 
ification, written in UML by kernel developpers, 
which documents dependencies between compo- 
nents and component libraries. Dependencies han- 
dled by the configurator correspond to classical 
fiinctional component dependencies resulting from 
provides and requires declarations (provides means 
that a component supports an interface of the given 
type, requires mean that a component requires an 
interface of the given type to be present in its en- 
vironment in order to correctly operate). An ini- 
tialization scheduler, analogous to the OSKit’s Knit 
tool [24], can be used to statically schedule compo- 
nent initialization (through calls to component con- 
structors) at boot-time. The configurator also in- 
cludes a visual tool to browse composition graphs. 


Using the open interface compiler, interface descriptions 
written in Java are mapped onto C declarations, where 
Java types are mapped on C types. The set of C types 
which are the target of this mapping constitutes a subset 
of possible C signatures. However, we have not found 
this restriction to be an impediment? for developing the 
KORTEX library. 


Code generation takes place in two steps. The first step 
compiles interface descriptions written in Java into C 
declarations and code for interface descriptors. These 
are then linked with component implementation code. 
Binding components are also generated from interface 
descriptions but use a specific interface compiler (typ- 
ically, one per binding type), built using our open in- 
terface compiler. The second step assembles a kernel 
image in ELF binary format from the specification of a 
component graph. 


During execution, a kemel can load a new component, 
using the KORTEX dynamic linker/loader, or start a new 
application, by using the KORTEX application loader. 


‘Note that, if necessary, it is always possible to specialize the open 
interface compiler to map designated Java interfiace types onto the re- 
quired C types. 
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4 KORTEX, a component library 


To simplify the development of operating system ker- 
nels and applications using THINK, we have designed 
a library of services that are commonly used in operat- 
ing system construction. This library, called KORTEX, is 
currently targeted for Apple Power Macintoshes*, KOR- 
TEX currently comprises the following major compo- 
nents: 


e HAL components for the PowerPC that reify ex- 
ceptions and the memory management unit. 


e HAL components that encapsulate the Power Mac- 
intosh hardware devices and their drivers, including 
the PCI bus, the programmable interrupt controller, 
the IDE disk controller, the Ethernet network card 
(mace, bmac, gmac and Tulip), and the graphic card 
(frame-buffer). 


e Memory components implementing various mem- 
ory models, such as paged and flat memory. 


e Thread and scheduler components implementing 
various scheduler policies, such as cooperative, 
round-robin and priority-based. 


e Network components, architected according to 
the «-kernel communication framework, including 
Ethernet, ARP, IP, UDP, TCP and SunRPC proto- 
cols. 


e File system components implementing the VFS 
API, including ext2FS and NFS. 


e Service components that implement a dynamic 
linker/loader, an application loader and a small 
trader. 


e Interaction components that provide different types 
of bindings. 


e Components implementing part of the Posix stan- 
dard. 


While many of these components are standard (for in- 
stance, the thread and memory components have been 
directly inspired by the L4 kernel [9]), several points 
about KORTEX are worth noting. First, KORTEX sys- 
tematically exploits the core THINK framework pre- 
sented above. In particular, KORTEX interaction compo- 
nents presented in section 4.5 all conform to the THINK 
binding model. The diversity of interaction semantics 
already available is a testimony to the versatility of this 


+The choice of PowcrPC-based machines may seem anecdotal, but 
a RISC machine does offier a more uniform environment for operating 
system design. 
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model. Second, KORTEX remains faithful to the over- 
all THINK philosophy which is to not impose specific 
design choices to the OS architect. This is reflected in 
the fact that most KORTEX components are very fine- 
grained, including interaction components. For instance, 
syscall bindings (whose structure and semantics are typ- 
ically completely fixed in other approaches) are built 
as binding components in KORTEX. Another example 
can be found with the HAL components in KORTEX, 
which strictly reflect the capabilities of the supporting 
hardware. Third, KORTEX provides additional optional 
frameworks to help OS architects assemble specific sub- 
systems. KORTEX currently provides a resource man- 
agement framework and a communication framework. 
The former is applied e.g. to implement the thread and 
scheduling components, while the latter is applied to im- 
plement remote bindings. Finally, we have strived in im- 
plementing KORTEX to minimize dependencies between 
components. While this is more a practical than a de- 
sign issue, we have found in our experiments that fine- 
grained, highly independent components facilitate com- 
prehension and reuse, while obviously yielding more ef- 
ficient kernels, with smaller footprints. This is an advan- 
tage compared to the current OSKit library, for instance. 


4.1 HAL components for the PowerPC 


KORTEX provides HAL components for the PowerPC, 
including a HAL component for PowerPC exceptions 
and a HAL component for the PowerPC Memory Man- 
agement Unit (MMU). The operations supported by 
these components are purely functional and do not mod- 
ify the state of the processor, except on explicit demand. 


The KORTEX HAL components manifeststrictly the ca- 
pabilities of the supporting hardware, and do not try to 
provide a first layer of portability as is the case, e.g. with 
ttChoices’ nano-kermel interface [31 ]. 


Exceptions 


The PowerPC exceptions HAL component supports a 
single interface, which is shown in Table 4. The goal 
of this interface is to reify exceptions efficiently, with- 
out modifying their semantics. In particular, note that, 
on the PowerPC, processing of exceptions begins in su- 
pervisor mode with interrupts disabled, thus preventing 
recursive exceptions. 


When an exception id occurs, the processor invokes 


interface Trap | 
void TrapRegister(int id, Handler handler); 
void TrapUnregister(int id); 
void TrapSetContext(int phyctx, Context virtctx); 
Context TrapGetContext(); 
void TrapRetum(); 


Figure 4: Interface for PowerPC exception 


one of the internal component methods TrapEnter ,4. 
There is an instance of this method in each exception 
vector table entry. These methods first save the gen- 
eral registers, which form the minimal execution con- 
text of the processor, at a location previously specified 
by the system using the method TrapSetContext. 
This location is specified by both its virtual and phys- 
ical addresses, because the Power PC exceptions HAL 
component is not aware of the memory model used by 
the system. TrapEnterj;g also installs a stack for 
use during the handling of the exception. A single 
stack is sufficient for exception handling because the 
processor disables interrupts during the handling of an 
exception. Next, TrapEntery,;g invokes the handler 
previously registered by the system using the method 
TrapRegister. When this handler finishes, the han- 
dler calls TrapReturn to restore the saved execution 
context. The cost for entering and returning from an ex- 
ception on a PowerPC G4 running at 500 Mhz its shown 
on table |. 


"Operation | instructions 


time (zs) cycles 
0.160 80 
0.110 Ss 
0.270 135 


TrapEnterig 57 
TrapReturn 48 
105 


Table 1: Cost for handling a exception 





Although minimal, this interface provides enough func- 
tionality, e.g. to directly build a scheduler, as shown in 
section 4.4. The exceptions HAL component is com- 
pletely independent of the thread model implemented by 
the system that uses its service. 


Memory Management Unit 


The PowerPC Memory Management Unit (MMU) HAL 
component implements the software part of the Pow- 
erPC MMU algorithm. This component can be omitted 
in appliances that need only flat memory. Table 5 shows 
the interface exported by this component. 
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interface MMU { 
void MMUsetpagetable(int virt, 
int phys, int sz); 
void MMUaddmapping(int vsid, int virt, 
int phys, int wimg, int pp); 
void MMUremovemapping(int vsid, int virt); 
PTE MMUgetmapping(int vsid, int virt); 
void MMUsetsegment(int vsid, int vbase); 
void MMUsetbat(int no, int virt, int phys, 
int size, int wimg, int pp); 
void MMUremovebat(int no): 


Figure 5: Interface tor PowerPC MMU 


The MMUset pagetable method is used to specify the 
location of the page table in memory. Since the Pow- 
erPC is a segmented machine, the MMUsetsegment 
method is used to set the sixteen 256 MB seg- 
ments, thus providing a 4 GB virtual address space. 
The MMUaddmapping, MMUremovemapping and 
MMUgetmapping methods add, remove and obtain in- 
formation about page translation. 


The methods MMUsetbat and MMUremovebat reify 
the PowerPC Block Address Translation (BAT) regis- 
ters. These registers provide a convenient way to builda 
stngle flat address space, such as can be used in low-end 
appliances. The two main benefits are speed of address 
translation and economy of page table memory use. 


4.2 Resource management framework 


KORTEX provides a resource management framework 
which can be applied to all resources tn the system 
at various levels of abstraction. The framework com- 
prises the resource and manager concepts as given in 
Figure 6. A resource manager controls lower-level re- 
sources and uses them to construct higher-level ones. 
New resources (e.g. threads) can be created through op- 
eration create, whereas resource allocation is effected 
through the bind operation which creates a binding to 
a given resource. In other words, a resource is allocated 
to a component when a binding has been created by the 
resource manager between the component and the re- 
source. In this case, the hint parameter of the bind 
operation can contain managing information associated 
with the resource (e.g. scheduling parameters to be as- 
sociated with a thread). 


Several KORTEX components are architected accord- 
ing to the resource framework: threads and schedulers, 
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interface AbstractResource { 
void release(); 

} 

interface ResourceManager extends BindingFactory { 
AbstractResource create(...); 


Figure 6: Resource management framework 


memory and memory managers, network sessions (re- 
sources) and protocols (resource managers). 


4.3 Memory management components 


KORTEX provides memory management components 
that implement various memory models, such as paged 
memory and flat memory. A paged memory model can 
be used by systems that need multiple address spaces, 
for example to provide a process abstraction. The flat 
memory component can by used by systems that need 
only a kernel address space, as can be the case e.g. in 
low-end appliances. KORTEX also provides a compo- 
nent that implements the standard C allocator. Compo- 
nents implementing the two memory models and the al- 
locator are described below. 


The flat memory components implement a single ker- 
nel address space component that includes all of phys- 
ical memory. This address space 1s provided by using 
MMUsetbat exported by MMU HAL (see Section 4.1). 
This component supports an address space interface pro- 
viding methods to map and unmap memory in this ad- 
dress space. The implementation of this component is 
essentially void but the address space interface tt sup- 
ports is useftl to provide a transparent access to mem- 
ory for components, such as drivers, that need to map 
memory and that can operate stmilarly with either flat 
memory or paged memory. 


Components providing the paged memory create, dur- 
ing initialization, a page table in memory and an address 
space for the kemel. An address space manager compo- 
nent provides an interface for creating new address space 
components. Address space components support inter- 
faces of the same type as that of the flat memory address 
space component. Physical memory page allocation is 
provided by a standard buddy system component. 


Finally, a dynamic memory allocator component pro- 
vides the implementation of the standard GNU memory 
allocator. 
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4.4 Thread and scheduler components 


KORTEX provides three preemptive schedulers that pro- 
vide an same interface of the same type: a cooperative 
scheduler, a simple round-robin scheduler and a priority- 
based scheduler. They allow the usual operations on 
threads: creating and destroying threads, as well as al- 
lowing a thread to wait on a condition and to be notified. 
If threads are not running in the same address space, 
then the scheduler performs the necessary address space 
switch in addition to the thread context switch. 


These schedulers are implemented using the Pow- 
erPC exceptions HAL component described in section 
4.1. They can be implemented by simply installing 
a timer interrupt handler, in fact the PowerPC decre- 
menter. On a decrementer exception, the handler uses 
TrapSetContext toreplace the pointer to the execu- 
tion context of the current thread with a pointer to the 
execution context of the newly scheduled thread. Due 
to the simplicity of the HAL, these schedulers can be 
very efficient. Table 2 presents context switching costs 
on a PowerPC G4 at 500 Mhz. For example, a context 
switch between two threads in the same address space 
costs 0.284 js, and between two threads in different ad- 
dress spaces costs 0.394 jis. This permits the use of ex- 
tremely small time slices, which can be useful e.g. fora 
real-time kernel. 


time (4s) cycles 
0.284 142 
0.394 197 


instructions 





thread switch 111 
process switch 147 


Table 2: Context switching costs 


4.5 Interaction components 


KORTEX provides many different types of bindings be- 
tween components, which may be localized in differ- 
ent domains (e.g. the kernel, an application, or a remote 
host). 


Local binding 


This binding type ts the simplest form of binding and ts 
used for interactions between components in the same 
domain. It 1s implemented by a simple pointer to an in- 
terface descriptor. 


Syscall binding 


This binding type can be used by systems that support 
multiple address spaces to provide application tsolation. 
The syscall binding allows an application to use ser- 
vices provided by the kernel. A syscall binding is i1m- 
plemented using a client stub that performs a hardware 
syscall instruction sc, thus triggering an exception. The 
syscall trap handler then calls the target interface com- 
ponent. The application can pass up to seven arguments 
in registers (r4 through r10) to the target. The remain- 
ing arguments, if any, must be passed in shared memory 
or on the user stack. 


An optimization of the syscall binding can exploit the 
System V ABI specification calling conventions [35]. 
Registers (r1, r14 to r31) are non volatile between 
method calls and it is not necessary to save them in the 
calling stub. Other registers (r0, r3 to r13) are lost 
during method calls, and it is not necessary to save them 
either. Obviously this optimisation assumes that the ABI 
call conventions are obeyed. This optimization can save 
about 70 cycles per syscall. 


Upcall and Signal binding 


The upcall and signal bindings allow the kernel to in- 
teract with an application. A signal binding is used to 
propagate an exception to the currently running appli- 
cation, while an upcall binding is used to propagate an 
exception to an application running ina different address 
space than the current one. Upcall and signal bindings 
are very efficient because they merely invoke a dedicated 
handler in the application. The binding first updates the 
instruction and stack pointers, and then invokes the han- 
dler in the application using the special instruction rfi. 
The exception context 1s also propagated. This handler 
then calls the target component interface, which is des- 
ignated by its memory address stored in the r3 register. 


Because the exception context is propagated, the upcall 
binding is not completely secure: an upcalled compo- 
nent may never return, thus monopolizing the proces- 
sor. Several standard solutions can be used to build a 
secure upcall binding, for instance activating a timeout 
(unmasked prior to switching control to the upcalled ad- 
dress space) or using shared memory anda yield mech- 
anism to implement a software interrupt. 
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Synchronous LRPC binding 


An LRPC binding implements a simple synchronous in- 
teraction. It uses the syscall and upcall bindings. The 
syscall binding stub directly calls the upcal]l stub which 
calls the target application component interfiace. 


Remote RPC binding 


A remote binding implements a simple remote opera- 
tion invocation protocol, which provides transparent ac- 
cess to components on a remote host. The binding di- 
rectly builds and sends Ethernet packets, using the net- 
work protocol components. Although the binding is de- 
signed to work between kernels, it can support interac- 
tion between remote applications when combined with 
the syscall and upcall bindings. 


5 Evaluation 


In this section, we describe several experiments in 
assembling different operating system kernels using 
THINK. We have implemented a minimal extensible dis- 
tributed micro-kemel, a dedicated kemel for an active 
router, one for a Java virtual machine, and another for 
running a DOOM game on a bare machine. 


All measurements given in this paper are performed on 
Apple Power Macintoshes containing a PowerPC G4 
running at SOOMhz (except for the PlanP experiment, 
which has been done on a PowerPC G4 at 350Mhz), 
with IMB external cache and 128MB memory. Net- 
work cards used in our benchmarks are Asanté Fast PC! 
100Mbps cards based on a Digital 21143 Tulip chip. 


5.1 Anextensible, distributed micro-kernel 


We have built a minimal micro-kernel which uses L4 
address space, and thread models. Instead of L4 IPC, 
we used KORTEX LRPC binding. The resulting ker- 
nel size is about 16KB, which can be compared with 
a 1IOKB to ISKB kernel size for L4 (note that L4 has 
been directly hand-coded in assembly language). Figure 
7 depicts the component graph associated with this min- 
imal micro-kernel. The figure shows the relationships 
between resources and resource managers, interfaces ex- 
ported by components, as well as language bindings and 
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local bindings used for combining the different compo- 
nents into a working kernel. 


Table 3 summarizes the performance of synchronous 
bindings provided by the KORTEX library. Each call has 
a single argument, the this pointer, and retums an tn- 
teger. An interaction via a local binding takes 6 cycles. 
This shows that a basic interaction between THINK com- 
ponents does not incur a significant penalty. The KOR- 
TEX syscall binding takes 150 cycles, which can be re- 
duced to only 81 cycles when applying the optimisation 
described in section 4.5. By comparison, the Linux 2.4 
syscall for implementing the get pid syscall takes 217 
cycles. 


instructions 
6 


syscall 115 
optimized syscall 50 


time(jis) cycles 

0.016 8 

0.300 150 
0.162 8 | 

0.128 64 
0.346 173 
0.630 315 
0.490 245 


signal a5 
LRPC De 


Table 3: Performance of KORTEX bindings 


Adding a dynamic loader component to this small 
micro-kemel yields a dynamically extensible kernel, al- 
though one without protection against faulty compo- 
nents and possible disruptions caused by the introduc- 
tion of new components. The size of this extensible 
kernel is about 160KB with all components, including 
drivers and managers needed for loading code from a 
disk. 


By adding remote RPC components to the extensible 
kernel, we obtain a minimal distributed system kemel, 
which can call previously exported resources located on 
remote hosts. 


Table 4 shows the costs of interaction through our re- 
mote RPC binding. The table gives the time of com- 
pletion of an operation invocation on a remote compo- 
nent, with null argument and an integer result. The mea- 
surements were taken with an Ethernet network at 10 
Mbps and at 100 Mbps. A standard reference for low- 
latency RPC communication on a high speed network ts 
the work done by Thekkath et al. [32]. Compared to 
the 25Mhz processor used in their test, a back of the en- 
velope computation* would indicate that our results are 


on a par with this earlier work®. Furthermore, the Kor- 


5(500/25)*(11.3+4) = 306 microsecond at 1OMbps to compare 
with 296 microsecond found in [32}. 
SEspecially since the breakdown of the costs is consistent with 
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Figure 7: An example kernel configuration graph 


TEX remote RPC binding can be compared with vari- 
ous ORBs such as Java RMI. Here, the 40 microsecond 
synchronous interaction performance (even adding the 
costs of syscalls at both sites) should be compared with 
the typical 1 millisecond cost of a similar synchronous 
interaction. 


Network 


| Time (ys) 
type 


network 
link driver 
11.3 
(6.3%) 
11.3 
(28 .3%) 


marshall. 
+null call 


4 
4 


10baseT 180 164.7 
a ere 
100baseT | 40 24.7 
ae ee 


Table 4: Performance of synchronous remote binding 


These figures tend to validate the fact that the THINK 
framework does not preclude efficiency and can be used 
to build flexible, yet efficient kernels. 


those reported in [32]. 





5.2 PlanP 


PlanP [33] is a language for programming active net- 
work routers and bridges, which has been initially pro- 
totyped as an in-kernel Solaris module, and later ported 
to the Linux operating system (also as an in-kernel mod- 
ule). PlanP permits protocols to be expressed concisely 
in a high-level language, yet be implemented efficiently 
using a JIT compiler. While PlanP programs are some- 
what slower than comparable hand-coded C implemen- 
tations, a network intensive program such as an Ethemet 
learning bridge has the same bandwidth in PlanP as in 
the equivalent C program. This suggests that the Solaris 
and Linux kernels must be performance bottlenecks. ’ 


To show that we can get rid of this bottleneck in a THINK 
system, we took as an example a learning bridge proto- 
col Plearn, programmed in PlanP, and we measured 
throughput on Solaris, Linux and a dedicated kernel built 
with KORTEX. The configurations used in our four ex- 


7PlanP runs in the kernel in supervisor mode: there is no copy of 
packets due to crossing kernel/user domain boundaries. 
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periments were as follows. In all experiments, the hosts 
were connected via a l0OMbps Ethernet network, and 
the two client hosts were Apple Power Macintoshes con- 
taining a SOOMhz PowerPC G4 with 256Mb of main 
memory and running the Linux 2.2.18 operating sys- 
tem. In the first experiment we measured the through- 
put obtained with a null bridge, 1.e. a direct connection 
between the two client hosts. In the second experiment, 
the bridge host was a 167Mhz Sun Ultra | Model 170s 
with 128Mb of main memory running Solaris 5.5. In the 
third experiment, the bridge host was an Apple Power 
Macintosh G4 350Mhz with 128Mb of main memory 
running Linux 2.2.18. In the fourth experiment, the 
bridge host was the same machine as in the third ex- 
periment but running KORTEX. Throughput was mea- 
sured using ttcp running on client hosts. Table 5S shows 
the throughput of the Pl earn PlanP program running 
on Solaris, Linux and KORTEX. As we can see, using 
the KORTEX dedicated kemel increased the throughput 
more than 30% compared to Linux (from 65.SMBps for 
Linux to 87.6Mbps for KORTEX). 


bridge throughput 
none _ 91.6Mbps 
42.0Mbps 
65.5Mbps 
87.6Mbps 


PlanP/Solaris, Sparc 166Mhz 
PlanP/Linux, PowerPC 350Mhiz 
PlanP/KORTEX, PowerPC 350Mhz 





Table 5: Performance of the THINK implementation ver- 
sus Solaris and Linux implementation 


5.3 Kaffe 


Kaffie is a complete, fully compliant open source Java 
environment. The Kaffe virtual machine was designed 
with portability and scalability in mind. It requires 
threading, memory management, native method inter- 
facing and native system calls. Kaffe was ported to 
a dedicated THINK kernel by mapping all system de- 
pendencies to KORTEX components. For example, ex- 
ception management makes direct use of the excep- 
tions HAL component, whereas preemptive threads have 
been implemented on both the priority-based scheduler, 
which provides a native thread like semantics, and the 
cooperative scheduler which provides a Java thread like 
semantics. Thanks to our binding and component frame- 
work, making this change requires no modification in 
the threading code. Table 6 compares the performance 
of Kaffe when running on Linux and when mnning on 
our dedicated kernel. As we can see, exception manage- 
ment is better on the dedicated kernel due to the reduced 
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cost of signals, whereas native threads perfiorm as well 
as Java threads. 


When porting a JVM, most of the time is spent in adapt- 
ing native methods. Thanks to the reuse of KORTEX 
components, implementing the Kaffie dedicated kernel 
took one week. 


When executing standard Java applications with small 
memory needs, the memory footprint is 125KB for 
KORTEX components, plus 475KB for Kaffe virtual ma- 
chine, plus 1 MB for bytecode and dynamic memory, for 
a total of !.6MB. 


5.4 Doom 


An interesting experiment is to build a dedicated ker- 
nel that runs a video game (simulating e.g. the situation 
in a low-end appliance). To this end, we have ported 
the Linux Doom, version LxDoom [16], to THINK, us- 
ing the KORTEX flat memory component. The port took 
two days, which mainly consisted in understanding the 
graphic driver. The memory footprint for this kernel 
is only 9SKB for KORTEX components, 900KB for the 
Doom engine and SMB for the game scenario (the WAD 
file). 


The THINK implementation is between 3% and 6% 
faster than the same engine directly drawing on the 
frame-buffier and running on Linux starting in single user 
mode, as shown in table 7. Since there are no system 
calls during the test, and the game performs only compu- 
tation and memory copy, the diffierence is due to residual 
daemon activity in Linux and to the use of the flat mem- 
ory which avoids the use of the MMU. To pinpoint the 
cost of the latter, we have built the same application by 
simply switching to the use of the KORTEX paged mem- 
ory management component. As we can see, the use of 
the MMU adds about 2% on the global execution time. 
While the performance benefits are barely significant in 
this particular case, this scenario illustrates the potential 
benefits of the THINK approach in rapidly building opti- 
mized, dedicated operating system kemels. 


external resolution 
320x200 640x480 1024x768 
1955 491 
1914 485 
1894 483 


KORTEX(fiat) 


KORTEX(MMU) 
Linux 


Table 7: Doom frames per second 
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Benchmark | Kaffe /Linux 

| enema 
synchronized(o) { } 0,527 ps 
try {} catch(...) {} 1,790 ps 
try {null.x()} catch(...) {} 12,031 ps 
try {throw} catch(...) {} 3,441 ps 
Thread.yield() 6,960 ps 





Kaffe/KORTEX Kaffe /KORTEX 
(java-t hread) (nati ve-thread ) 
0,363 ps 0,363 jes 
1,585 ps 1,594 ps 
5,094 ps 5,059 ps 
2,448 ps 2,4 34 jes 
6,042 ps 6,258 ps 


Table 6: Evaluation of the Kaffe dedicated THINK kernel 


6 Assessment and Future Work 


We have presented a software framework for building 
flexib le operating system kemels from fine-grained com- 
ponents and it assocmted tools, including a library of 
commonly used kemel components. We have evaluated 
our approach on a PowerPC architecture by implement 
ing components providing services functionally similar 
to those implemented in the L4 keel, and by assem- 
bling specific kernels for several applications: an ac- 
tive network router, a Java virtual machine, and a Doom 
game. The micro-benchmarks (e.g. context switching 
costs and bind ing costs) of our component-based micro- 
kerne] show a level of performance that indicates that, 
thanks to our flexible binding model, building an op- 
erating system kernel out of components need not suf- 
fer from performance penalties. The application bench- 
marks for our example dedicated kernels show improved 
performances compared to monolithic kernels, together 
with smaller footprints. We have also found that de- 
ve loping specific operating system kernels can be done 
reasonably fast, thanks to our framework, component li- 
brary, and tools, although our evidence in this area re- 
mais purely anecdotal. 


This encourages us to pursue our investigations with 
THINK. In particular, the following seem worth pursu- 
ing: 


e Investigating reconfiguration functions to support 
run-time changes in bindings and components at 
different levels in a kerne] while maintaining the 
overal] integrity ofthe system. 


e [nvestigating program specialisation techniques to 
further improve performance following examples 
of Ensemble [15] and Tempo [20]. 


e Developing other HAL components, in particular 
for low-end appliances (e.g. PDAs), as well as 
ARM.-based and Intel-based machines. 


e Developing a real-time OS component library and 
exploiting it for the construction of an operating 
system kerne] dedicated to the execution of syn- 
chronous programming languages such as Estere] 
or Lustre. 


e Exploiting existing OS libraries, such as OSKit, 
and their tools, to enhance the KORTEX library 
and provide a more complete development en viron- 
ment. 


Availability 


The KORTEX source code is available free of charge for 
research purposes from the first two authors. 
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Ninja: A Framework for Network Services 


J. Robert von Behren, Eric A. Brewer, Nikita Borisov, Michael Chen, Matt Welsh 


Josh MacDonald, Jeremy Lau, Steve Gribble” and David Culler 
University of California at Berkeley 


Ninja is a new framework that makes it easy to cre- 
ate robust scalable Internet services. We introduce a 
new programming model based on the natural parallel- 
ism of large-scale services, and show how to implement 
the model. The first key aspect of the model is intelligent 
connection management, which enables high availabil- 
ity, load balancing, graceful degradation and online 
evolution. The second key aspect is support for shared 
persistent state that is automatically partitioned for 
scalability and replicated for fault tolerance. We discuss 
two versions of shared state, a cluster-based hash table 
with transparent replication and novel features that 
reduce lock contention, and a cluster-based file system 
that provides local transactions and_ cluster-wide 
namespaces and replication. Using several applications 


we show that the framework enables the creation of 


scalable, highly available services with persistent data, 
with very little application code — as little as one-tenth 
the code size of comparable stand-alone applications. 


1 Introduction 


The Ninja Project is focused on Internet infrastruc- 
ture and the need for a better way to create, maintain and 
operate robust giant-scale distributed systems. Although 
the overall project [GWv+01] addresses wide-area sys- 
tems, in this paper we study building robust large-scale 
centralized network services. Thus we focus on clusters 
within a single administrative domain that act as a cen- 
tralized server for many users and potentially many ser- 
vices. The primary goal 1s to deal in full with the word 
‘robust’, which includes basic problems of scalability, 
availability, fault tolerance, and persistence. 

Network services include almost all aspects of large 
web sites, including many non-HTTP services, such as 
instant-messaging, e-mail and the central-server aspects 
of peer-to-peer file sharing. These services have a form 
of natural parallelism that derives from supporting mil- 
lions of independent users; we thus define scalability, 
concurrency and high availability in terms of users or 
requests. The basic unit of work is thus a query or con- 
nection (depending on the service) from a specific user. 

We believe that the framework presented here ts the 
right way to build these services: both that the program- 
ming model is the right way to think about the service, 
and that the mechanisms we use greatly simplify service 
authoring. In some sense, this framework is our fourth 


*: Nowat the University of Washington 
This work was supported in part by DARPA #DABT 63-98-C-0038. 


version over a period of five years (starting with 
[FGCB97]) and therefore represents considerable refine- 
ment of both the model and mechanisms. Unfortunately, 
it is nearly impossible to prove that a framework is 
“right” — instead we focus on describing the principles 
and invariants provided by the framework and why they 
simplify service authoring, and we examine the code 
size of several representative services and show that 
they are remarkably small given that they are scalable, 
highly available and persistent in the presence of faults. 
We explicitly do not look at those parts of a site built 
on top of a database management system (DBMS) for 
several reasons. First, there is much work in industry on 
this topic and several products that work well. Second, 
our work is complementary to database research and 
would be easy to integrate with a DBMS by using Ninja 
as an “application server”. Third, we tend to focus on 
high availability, rather than transactions, and support a 
wider range of semantics than ACID [GR97]. However, 
we do look at persistence, replication, atomicity, and 
consistency, and many things done with a database are 
perhaps better done directly in Ninja (see Section 6). 
The requirements for network services are very 
demanding. By “robust” we mean all of the following: 


Scalability: the ability to support 1OOM users. 


High Availability: the ability to answer queries 
nearly all of the time. Ninja services should be able 
to reach 4 or S nines, that is, the probability of 
answering a query should be above 0.9999 (when 
desired). High availability means that most queries 
succeed and that if a query fails, retrying it has a 
high probability of success: ideally, retries should 
be independent trials. This differs from the harder 
goal of “fault tolerance” in which a query must 
complete correctly without a client-visible retry. 


Persistent Data: Like high availability, this is a 
specific form of fault tolerance: that data survives 
faults. This requires replication, and much of the 
framework will deal with automating replication for 
availability and persistence. There 1s often, but not 
always, an implied sub-goal of consistency for the 
data. We support a range of perfomance and 
consistency tradeoffs, with the default being 
linearizability (HW87]. 

Graceful Degradation: We cannot assume that there 
will be sufficient resources to always handle the 
offered load. Instead, we aim for graceful 
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degradation through admission control and 
prioritization of requests. We aim to achieve the 
maximum throughput even when overloaded. 


Online Evolution: A variation of high availability, 
online evolution is the ability to upgrade the service 
in place without significant downtime. In most 
cases, we can upgrade a service without downtime. 


One primary goal 1s to make achieving these proper- 
ties easy for service authors. We have developed several 
example applications that exhibit robustness; we judge 
ease of authorship primarily by code size. To achieve 
ease of authorship, we follow employ three principles: 


Exploiting Clusters: In a data-center environment, 
we can make many assumptions that are not true in 
general for distributed systems. These include a 
reliable source of power, temperature control, 
physical security, 24-hour monitoring, and a 
partition-free internal network. 


Programming Model: We believe that a fully 
general programming model makes it impossible to 
provide robust services. Instead, we use namespaces 
and narrow interfaces to control the sharing, 
replication, and persistence of data, which means 
that we do not have to provide these properties for 
all data at all times. Second, we forsake general 
multi-threaded concurrency for a specific style that 
matches the natural parallelism. We thus focus on 
inter-task parallelism rather than  intra-task 
parallelism, although we support asynchronous {/O. 
We show that this model is sufficiently expressive 
to write a wide variety of services. 


Hide Complexity: We share with DBMS research the 
goal of hiding the complex details of replication, 
persistence, load balancing and fault tolerance from 
applications. However, we do so through the use of 
reusable data structures and libraries rather than via 
an abstract data model and a declarative language 
(SQL). Both approaches enable strong properties 
with relatively little application code, but our 
approach fits more naturally with applications 
written in imperative languages such as C or Java. 


We define the programming model in Section 2, and 
our key mechanisms in Section 3. Section 4 describes 
the applications and Section 5 presents their evaluation. 
Section 6 discusses our principles and related and future 
work, and Section 7 provides a summary. 


2 Programming Model 


The goal of the programming model 1s to simplify 
the creation of complex network services; such services 
must map naturally onto the programming model. Sec- 
ond, the model must enable an underlying implementa- 
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Figure 1: The Programming Model 

A node consists of threads, local state and shared 
state. Nodes use the same “program” (code base), 
and receive connections from the Connection 
Manager in the style of data parallelism. 





tion that hides the details of fault and load management, 
scalability, high availability, and online evolution. 
Given the natural parallelism described above, we 
choose a model based on request parallelism, in which 
we aim to partition users’ request streams across nodes. 
For ease of authoring, we would like to have a single 
program that is automatically spread across the cluster. 
Thus we choose to base our model on the single-pro- 
gram-multiple-data (SPMD) model commonly used in 
parallel] computing {DGNP88]. We make two extensions 


to the SPMD model: support for shared state! and man- 
agement of connections to the outside world. We refer 
to this new model as the single-program-multiple-con- 
nection (SPMC) model, as shown in Figure I. We also 
assume many threads per node, which differs from 
SPMD in practice, but not in definition. There are 
expected to be many connections per node, and there 
may be more or fewer threads than connections. One big 
practical difference of course is that we seek to achieve 
high availability and tolerance for partial failures, 
whereas SPMD was developed in the context of the all- 
or-nothing fault models of parallel machines. 

By “shared state” we mean that the threads and con- 
nections active on any subset of nodes may share global 
namespaces that support linearizable updates (i.e. 
strongly consistent, see [Hw87]) to network-accessible 
storage in a uniform manner across the cluster. This 
notion does not require shared memory, as assumed in 
the original SPMD work; instead we provide multiple 
global namespaces accessed via method calls rather than 
load/store instructions. This narrow interface to shared 


1: Although many SPMD systems had at least a shared namespace 
(e.g. CM-5 [HT93] and T3E [Sco96]), support was inconsistent and 
we thus treat this as an extension. 
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state simplifies consistency, replication and persistence. 

Conversely, we define non-shared state as “local 
state”, which includes local memory and files. Local 
State is simpler and faster to access than shared state, 
and is useful for session and thread state (e.g., stacks), 
caching shared state, and temporary files. 


Invariant: Shared state is strongly consistent across 
all nodes used by a service. 


Globally named shared state implies that any con- 
nection can be serviced from any node, which is a tre- 
mendous simplification. Shared state enables simple 
construction of groupware services, communication ser- 
vices (e.g. chat), and information dispersion (e.g. stock 
quotes); it also simplifies service management. For 
example, it 1s easy to count users, put them in (shared) 
queues, and track aggregate statistics about the service. 

Shared state can also be highly available and persis- 
tent (up to some configurable number of faults): 


Imariant: Shared state is highly available and persis- 
tent (when desired). 


High availability requires automatic replication, while 
durability requires management of replicas and disk 
writes. The power and simplicity of the SPMC model 
come from the automatic management of consistent, 
highly available, persistent shared state. Finally, we 
allow services to relax these invariants for better perfir- 
mance (Sections 3.4.] and S.S). 

We support multiple independent namespaces for 
shared data. This provides both encapsulation and logi- 
cal isolation of different services and system compo- 
nents, thus providing a simple form of security: services 
cannot read or write the shared state of other services or 
of shared system components. Additionally, support for 
many namespaces enables fine-grain control over con- 
sistency, availability (replication), and persistence, all of 
which are attributes of a namespace. 

We have found two kinds of shared state to be par- 
ticularly useful: shared data structures and shared files: 


Shared Data Structures: These have the same 
interface as normal data structures and are therefore 
very easy to use. We provide hash tables and B- 
trees. Hash tables are sufficient to support other 
models including tuple spaces and shared arrays. 
We also provide extensible atomic operations that 
enable programmers to create high-concurrency 
sharing primitives, such as compare-and-swap. 


Shared Files: Although we can store large items 
persistently in the hash table, we found that file 
usage is sufficiently diffierent and common to 
support directly. Some of the key differences 
include larger objects, larger working set sizes, 


lower expectation of being in memory, and the need 
for data streaming over the network. 


The second extension in SPMC 1s explicit support 
for connection management. The SPMD model does not 
define how outside I/O interacts with the nodes, except 
for possibly spreading files across the nodes. For net- 
work services the problem is much more dynamic: some 
state is long lived, and we must isolate down nodes from 
the clients to provide high availability. 


New or retried connections arrive at 
“up” nodes. 


Invariant: 


Note that we do not promise that connections do not go 
down: existing connections are lost when a node goes 
down. Although possible in theory, moving active con- 
nections when the server side dies is not practical. For 
example, every potentially client visible state change 
must be durable, which requires tracking those changes 
to either a replica or a persistent store, as they occur. 
Instead, we promise that retried connections are not 
affected by the failure. Using shared state, it is possible 
to keep session state across this transition as needed, 
which is simpler and much more tractable than tracking 
all client-visible state automatically. 

To enable more intelligent connection management, 
we add one key idea to the programming model: con- 
nections are partitioned into application-defined classes, 
which we call partitions. By default a service has only 
one class, in which case all connections are treated the 
same, but in practice explicit partitioning gives the 
author more control over the service. [In particular, a 
partition 1s the: 


Unit of Affinity: Connections in the same partition go 
to the same node(s), which enables cache affinity 
(similar to LARD _ [PAB+98]), and reduces 
communication for users within the partition. For 
example, if all users in a chat room are in the same 


partition, then the group state resides on that node.” 


Unit of Priority: Partitions allow the author to 
control graceful degradation and quality of service. 
In particular, we can support application-defined 
admission control, by dropping connections in low- 
priority partitions first. The same idea enables 
differentiated quality of service by partition: we can 
support different densities of users/node for high- 
and low-priority partitions. For example, high- 
paying stock traders might have less congestion and 
thus faster trades, especially during overload. 


2: Some communication ts sull required if the chat state ts repli- 
cated, but typically chat roomsare neither persistent nor highly avail- 
able; the application code would be almostidentical regardless. 
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Unit of Migration: Under a loalt imbalance or a fault, 
it may be necessary to migrate users’ state to a new 
node. Partitions are the unit of migration for fault 
recovery and load balancing. This ability also 
simplifies online evolution, as we can do a rolling 
upgrade by partition. 


In general, explicit partitions) are powerfil because 
we get simple application-level| guidance on how to 
group connections. We can then use these groups to pro- 
vide fine-grain control over replication, cache affinity, 
quality of service, graceful degradation and online evo- 
lution. Note that we partition connections and not users; 
we can use them to partition users or we can have the 
same user in different partitions simultaneously depend- 
ing on the task. Finally, service authors can ignore parti- 
tions if they need only even load balancing. 


3 Mechanisms 


In this section we examine the four key building 
blocks that we use to achieve the SPMC model. 


3.1 Clone Groups 

The first mechanism is to yirtualize the SPMC 
model: instead of each service running on a whole clus- 
ter, we instead run services on clone groups, which are a 
set of clones with common code and state. A clone is a 
virtual node that we map onto a real node dynamically; 
we refer to them as clones becausé they share the same 
code base (the “single program’’) and shared state. Thus 
when we discuss shared state or namespaces, it 1s 
always for a specific clone group. Similarly, connec- 
tions are managed across clone groups, not the cluster. 


Principle: Clone groups provide each service with a 
virtual cluster. 


Clone groups typically map onto a subset of the real 
nodes, and may vary in size depending on load. More 
than one clone group may map onto a node, in which 
case they are isolated in terms of state and namespaces, 
but not in performance. However, the connection man- 
ager described below can maintain even load balancing 
even if clones have uneven throughput due to differ- 
ences in hardware, work per connection, or interference 
from other groups. 

Clone groups provide several useful mechanisms to 
the programmer, including menibersl 1p, broadcast and 
barrier synchronization. Changes in membership lead to 
notification of all clones via birth and death events. 
Membership is approximate and eventually consistent, 
which has proven sufficient in practice. For example, 
we use death notification to instigate recovery within 
the shared data structures. 

Broadcast is mostly useful for! notification, since 
there is a better mechanism for sharing state. Barriers 
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could be implemented on top of shared state, but are 
actually done via message passing (i.e. events) because 
of the need to integrate dynamic membership informa- 
tion. A barrier is considered done when all live nodes 
reach the barrier, so death events may complete a bar- 
rier. AS with SPMD, barriers are used to ensure that all 
clones are in the same stage; the biggest use seems to be 
to denote the completion of an initialization phase. 

An overall manager, called the “shogun’’, dynami- 
cally modifies the size of each service’s virtual cluster 
based on utilization. Remarkably, most services don’t 
care about the size of the cluster, since the shared state 1s 
managed across the transition automatically, and no 
connections are lost during the transition (see Section 
5.4). A service can track changes using the birth/death 
events when needed. 

Typically, replication uses subsets of a clone group 
of storage nodes. A replica group is thus a subset of a 
clone group that handles replication for part of the 
shared data, so that we can decouple the degree of repli- 
cation from the number of clones. Replicas use the 
clone-group mechanisms to handle replica membership 
and synchronization. We use many small replica groups 
in One storage clone group, with overlapping member- 
ship. For example, with 2-way replication, a replica 
group is a two-node subset of a larger storage clone 
group. The use of lots of small groups reduces the 
recovery latency per group, and enables incremental 
recovery, where each small group is one step. By 
design, the groups are small enough that we can Just 
copy the whole contents of another replica atomically, 
without too much concern for the fact that we prevent 
updates to that group (only) during the copy. 


3.2 Single-node Run-Time System: SEDA 

An important aspect of building scalable services 1s 
to support very high concurrency and to avoid overcom- 
mitment of server resources. Building highly concurrent 
systems is inherently difficult: structuring code to 
achieve high throughput is not well-supported by exist- 
ing programming models, and traditional concurrency 
mechanisms, particularly threads, make it difficult for 
applications to exercise control over their resource 
usage. 

Ninja makes use of a concurrency design called 
SEDA, or staged event-driven architecture. Services are 
Structured as a set of stages connected by explicit event 
queues. This design permits each stage to be individu- 
ally conditioned to load (e.g., by performing threshold- 
ing on its incoming event queue), and facilitates 
modular application construction. SEDA, covered in 
detail in [WCBeI}, enables not only very high concur- 
rency, but also graceful degradation through resource 
management and adaptive load shedding. 
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For the purposes of this paper, SEDA provides two 
key capabilities: support for more connections/node and 
thus better overall performance, and detection of over- 
load at the node level, which we need to provide grace- 
ful degradation for the overall service. 

The scalability limits of threads are well-known and 
have been studied in several contexts, including Internet 
services (PDZ99} and parallel computing {RV89}. Gener- 
ally, as the number of threads grows, OS overhead 
(scheduling and aggregate memory footprint) increases, 
which leads to a decrease in overall performance. Direct 
use of threads presents several other correctness and 
tuning challenges, including race conditions and lock 
contention. 


Princtple: Concurrency is implicit in the program- 
ming model; threads are managed by the 
runtime system. 


Since we must avoid excess threads to achieve 
graceful degradation, we simply prevent services from 
creating threads directly. Instead, services only define 
what could be concurrent, via (explicitly) concurrent 
stages. Conceptually, each stage has a dedicated but 
bounded thread pool, but the allocation and scheduling 
of threads is handled by SEDA. Thus the system as a 
whole is event-driven, but stages may block internally 
(for example, by invoking a library routine or blocking 
1/O call), and use multiple threads for concurrency. The 
size of the stage’s thread pool must be balanced between 
obtaining sufficient concurrency and limiting the total 
number of threads; SEDA uses a feedback loop to man- 
age thread pools automatically. The particular policies 
are beyond the scope of this paper, as they only effect 
the node performance. Roughly, allocation is based on 
effective use of threads (non idle) and priorities, while 
scheduling is based on queue size and tries to batch 
tasks for better locality and amortization (as in (Larv0)). 

Internal framework modules, such as the shared- 
State mechanisms in Section 3.4, also use stages and 
avoid explicit thread creation. The internal modules are 
often written in the event-driven style, common for 
high-perfonnance servers {PDZ99], which we enable by 
providing non-blocking interfaces for all network and 
disk activity, and for the shared data structures. 


Overload detection is automatic; services 
are notified when they: are overloaded. 


/nvariant: 


A key property of queues is that it becomes possible 
to implement backpressure by thresholding the event 
queue for a stage. We use this to detect overloaded 
Stages and thus to initiate over/oad mode and graceful 
degradation. With only the implicit queues of blocked 
threads, it 1s difficult to detect overload until too late. 

Thus our single-node runtime system provides two 


key capabilities. First, it provides control over thread 
allocation and scheduling, which enables either thread- 
based or event-driven programming and ensures thread 
limits consistent with the operating range of the node. 
Second, it provides backpressure via explicit queues that 
enables the detection of overload and thus graceful deg- 
radation, which is shown in Section 5.3. 


3.3 Connection Manager (CM) 

The Connection Manager (CM) is responsible for all 
external names. It dynamically maps external names to 
clone groups and connections to an external name to a 
specific clone. It must hide failed nodes and balance 
load across the clone group. Although it appears in the 
figure as a single point of failure, it is actually a pair of 
“layer 7” switches [Fou0l] that provide automatic 
failover for each other. Based on ethernet switch relli- 
ability, we estimate the uptime of these switches at 


about 1-10’ each (seven 9’s), so the pair is extremely 
reliable. 

As an optimization, services can define partitions 
that are subgroups of names (and thus connections), to 
provide fine-grain control over resource allocation and 
graceful degradation. The CM can map partitions to 
subsets of clones in a clone group, or in the case of 
admission control deny partitions altogether. 


3.3.1 External Names 

The connection manager provides a level of indirec- 
tion for external service names. The CM maps external 
names to clone groups, which may change dynamically, 
and load balances connections among the clones. In 
general, the CM maps external (IP, port) pairs to the set 
of internal pairs corresponding to the clone group. When 
there are multiple clones, connections are balanced 
across the target set based on open connections. 

The CM tracks clone birth and death events in order 
to maintain high availability. Starting a clone is a two- 
step process. During initialization, Ninja allocates 
server sockets for a clone and starts it. It then registers 
the clone with the CM, which starts to forward connec- 
tions to the clone. Stopping a clone 1s the reverse pro- 
cess: the CM stops forwarding connections to the clone 
and then removes it from the clone group. To provide 
higher availability, the clone may finish processing out- 
standing requests before it exits. 


Invariant: Ninja can remap or resize clone groups 
without dropping connections. 


The ability to shutdown clones gracefully makes it pos- 
sible for Ninja to remap clones to nodes dynamically or 
to reduce the size of an underutilized clone group. The 
same ability enables online evolution to a new version 
with no downtime (shown in Section 5.4). 
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Figure 2: Delivering different Qualities of Service 

A three-node web server is divided into two 
partitions: Pattition 1 maps onto one node, Partition 2 
onto two. The CM achieves twice the asymptotic 
throughput and better response time for the larger 
partition, and both tiers show graceful degradation. 


3.3.2 Partitions 

In the current implementation, services can define 
partitions in two ways, either via ports or URL string 
matching. With ports, services map external port num- 
bers to partitions, which are then dynamically mapped 
to clones. Typically, services would use one external 
port number per partition, although more are allowed. 
For HTTP requests, the service can define partitions 
based on URL hashing and string matching; we cur- 
rently support prefix, suffix, and substring matching. 
This can be done at wire speed using current “layer 7” 
Switches [Fou0Ij. Given these partitions, the CM will 
dy namica lly map partitions to clones. 


Principle: Partitions enable division of the working 
set for higher throughput. 


As in LARD [PAB+98], partitions provide better locality 
and cache performance as the working set is partitioned 
across the clone group. Without partitions, the CM 
spreads load evenly, which effectively replicates the 
working set at each clone. 

Partitions are also the unit of priority which helps 
with tiered quality of service and graceful degra dation: 


Principle: Partitions enable tiered quality of ser- 
vice. 

First, the CM enables differential qualify of service by 
allocating varying resources to different partitions: par- 
titions need not map evenly onto clones. Figure 2 shows 
this proactive form of uneven load balancing. Partition | 
maps to one clone, while Partition 2 maps to two: the 
latter has twice the asymptotic throughput and better 
latency, particularly as Partition | reaches its overload 
point (about time 20). The variance of the one-node par- 
tition is higher as well, due to averaging effects. Note 
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that both graphs show graceful degradation, with 
smooth asymptotes and linearly increasing latency. 

Second, priorities enable more graceful degradation 
during overload. The CM implements an admission- 
control policy based on partition priorities: requests to 
low-priority partitions are dropped first. Individual 
nodes can detect overload directly (Section 3.2) and 
notify the CM. In overload mode, the CM drops 
requests by routing them to a generic “drop” clone, anal- 
ogous to the use of /dev/null in UNIX. This is comple- 
mentary with the first strategy and they may be used 
together. In addition to admission control, nodes may 
take action themselves in overload mode to reduce the 
average work per request. We evaluate these mecha- 
nisms in Section 5.3. 

There is also a relationship between failures and 
overload: when a clone fils, the remaining clones typi- 
cally receive increased load, which may put them into 
overload mode. The movement of load due to failures 
and the reaction to overload are independent mecha- 
nisms, but both are automated and overload mode will 
kick in only ifneeded. 

Finally, it is important to realize that partitions to do 
not effect shared state or replication. In the graph above 
for example, all three clones have the same program and 
shared state, but the CM allocates traffic unevenly by 
partition. This means that any clone can handle any par- 
tition, although they do not in normal operation If a 
node fails, the remaining two nodes are given both parti- 
tions automatically, which affects the difference in qual- 
ity but maintains high availability. 

To summarize, the connection manager provides 
management of all external names, including dynamic 
mapping of names to physical nodes for load balancing 
and fault tolerance. It also implements policies based on 
partitions that allow a service to define relative quality 
of service and prioritized admission control. 


3.4 Shared State 

The fourth mechanism provides services with shared 
state: currently shared hash tables, B-trees, and file sys- 
tems. The shared state mechanisms need to support 
robust applications, and therefore must be scalable, 
highly available, durable and consistent. By implement- 
ing these propertics in the shared state mechanisms, we 
essentially eliminate the burden of achieving them. In 
particular, we hide all of the issues of atomicity, replica- 
tion, consistency and recovery fiom service authors. 

Because high availability and consistency are 
incompatible in the presence of partitions [FB99}, we opt 
to use a redundant system-area network for communica- 
tion within our cluster (currently gigabit ethernet with 
redundant switches). A partition-free network allo ws us 
to use a two-phase commit protocol (2PC) [(GR97} to 
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ensure consistency and atomicity of updates to shared 
state across several nodes. The version of 2PC we use Is 
optimized for high availability in two ways. First, if a 
member of the protocol dies in the second phase, the 
2PC completes without it, because the replica will be 
able to recover a consistent image of its state from its 
peers later. Second, if the coordinator fails, we cannot 
afford to wait until it recovers to complete the protocol: 
instead, the replicas contact each other proactively after 
a timeout and commit the action if any member received 
a commit; otherwise, they all abort. The protocol ts also 
available to application writers to extend the framework 
with additional shared state mechanisms. 

In the next two sections, we examine the cluster hash 
table and file system in more detail. The cluster B-tree 1s 
ongoing work. 


3.4.1 Cluster Hash Table (CHT) 

Our prototypical shared data structure 1s the cluster 
hash table, which uses the traditional interface of three 
operations: get, put, and remove. Each operation ts 
atomic with strong consistency (equivalent to having a 
single copy). The underlying data is partitioned across 
the cluster for scalability, and replicated for high avail- 
ability (see [GBH+00] for more details). The degree of 
replication can be varied based on the requirements of 
the application, and different tables in the same service 
may use different replication strategies. This control 
enables tradeoffs among performance, fault tolerance 
and storage requirements, and also enables the composi- 
tion of modules without name or policy collisions. 


3.4.1.1 Non-Blocking Synchronization 

The atomic put operation on the CHT returns the old 
value prior to the update, in essence implementing an 
atomic swap. Atomic swap can be used to implement 
various synchronization primitives, such as locks (using 
test-and-set) or read-modify-write (Swapping in a 
“locked” value first and then the updated value). Such 
implementations, however, can be classified as blocking 
(Her91], in that a process holding a lock may take an arbi- 
trary time to complete. This reduces both scalability, 
due to lock contention, and availability, since a process 
may die while holding the lock. 

To overcome these obstacles, we extended the hash 
table interface with an apply operation, which imple- 
ments an atomic read-modify-write: 


apply (key, update_function) { 
temp = get (key) 
put (key, update_function (temp) ) 
return temp 
} 
The apply operation is implemented by shipping the 
name of the update function to the nodes that store the 
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Figure 3: Optimizations for Update Contention 

This graph plots single-node throughput under heavy 
contention in the CHT (without replication). Atomic 
Swap drops off due to long lock-holding times, while 
Weak Apply performs twice as well as apply, by not 
holding locks across the 2PC round trip. 


data and executing it there, analogous to function ship- 
ping in databases. We can use the name rather than the 
code, because of the “single program” facet of the 
SPMC model. Atomicity is ensured, as before, by the 
2PC protocol. Read-modify-write is sufficiently general 
to implement a wide range of atomic primitives, such as 
compare-and-swap, fetch-and-add, etc. Several of these 
primitives, such as compare-and-swap, are wniversal 
[Her91), and thus can be used to build non-blocking and 
wait-free implementations of a data structure from a 
sequential one. 

However, we can build non-blocking data structures 
directly: unlike conventional shared-memory systems, 
each location in a hash table stores an entire object, as 
opposed to just a pointer or a primitive value. This 
allows us to provide the update function from a sequen- 
tial implementation as the argument to the apply func- 


tion. The update is atomic and_ non-blocking.” 
Therefore, the operation is naturally wait-free, without 
the complexity or overhead usually associated with 
wait-free protocols. 

The improved scalability of apply-based updates can 
be seen in Figure 3. We compare an update imple- 
mented using two atomic swaps to one implemented by 
an apply operation; the graph shows the aggregate 
throughput of several chents continuously updating the 
same data value. The atomic swap implementation per- 
formance quickly degrades as concurrency increases, 
since more time is spent trying to obtain the lock. The 
apply-based implementation performs better at the out- 
set, since it requires half as many operations to complete 


3: This is not strictly true (nor could it be) for an arbitrary update 
function; however, we assume simple non-blocking update functions, 
which holds in actual use. Our most complex update function appends 
to a list represented as an array and sometimes has to resize the array. 
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an update, and the aggregate throughput remains virtu- 
ally flat as we scale up to 16 nodes. 

The graph also shows performance of a “weak” ver- 
sion of the apply operation; this version has exactly the 
same interface, but weaker consistency semantics. 
Namely, it commits the update in the first phase of the 
2PC; the second phase is only to let replicas know that 
everyone made the update (in case the coordinator fails). 
The update is eventually executed atomically on each 
node; however, cluster-wide atomicity is not achieved. 
In particular, updates may be executed in different 
orders at different nodes. These relaxed semantics allow 
for a significant performance improvement, since locks 
are held only forthe duration of the local update and not 
across the round-trip interaction with a coordinator. 

An example data structure that takes advantage of 
these weaker semantics is an unordered list. Insert and 
remove operations are commutative in unordered lists, 
so “weak apply” semantics are sufficient. Such lists are 
used by several of our applications. 


3.4.1.2 Replication and Performance 

As mentioned above, the CHT replicates data for 
high availability. The implementation distinguishes 
storage clones from the libraries that clones include to 
use the CHT, which decouples clone group size from 
which nodes actually store data. The set of storage 
nodes remains stable except for faults and explicit oper- 
ator-controlled repartitioning. Thus, the replica groups 
and recovery are managed entirely within the CHT 
implementation, and storage clones are shared by many 
tables and clone groups (with namespace isolation). 
Replication and durability polices are table-specific, but 
the storage clones are not. 

The degree of replication can have an impact on per- 
fonnance: more replicas will deliver higher read 
throughput, but lower throughput for updates. An 
extreme case of the latter effect can be observed when 
many updates to a single location in the hash table are 
attempted simultaneously: different nodes may prepare 
successfully for different instances of the 2PC protocol, 
causing all instances to fail. The chances of livelock 
increase with the number of conflicting updates and the 
degree of replication. 

Most data updates are largely independent, so such 
conflicts do not happen frequently, but when they do 
occur, their impact is significant. We were forced to add 
an algorithm that detects livelock and serializes prepare 
interactions with each replica. As Figure 4 illustrates, 
such detection improves performance of atomic apply to 
be tolerable, but there is still significant degradation. If 
this is unsatisfactory, the application designer has the 
choice of reducing the amount of replication or relaxing 
consistency: requirements by using “weak” apply, which 
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Figure 4: Update Contention with 2PC (2 replicas) 
This graph reveals the impact of livelock on updates 
using 2PC. Atomic Swap encounters livelock with 
even two competing updates. Proactive livelock 
detection is significant for Apply. 


does not experience livelock. Another possibility, not 
yet implemented, would be to use exponential backoff. 
For comparison, we also measured the performance of 
atomic-swap-based updates; we found that it becomes 
unusable under a moderate amount of contention, 
despite livelock detection. 


3.4.2 Cluster File System (CFS) 

The second form of shared state is the cluster file 
system (CFS). In contrast to the CHT, the cluster file 
system manages large blobs of persistent storage that 
are normally on disk, and supports the streaming of data 
directly from disks to clients. Although it is closely 
related to a traditional file system, we chose not to 
implement the normal UNIX file system interface for 
several reasons: 


¢ The traditional API limits atomicity. First, the only 
atomic operation is “rename” which requires 
copying whole files even for small updates. Second, 
file metadata operations are path based, which 
mixes path and file updates, and presents problems 
if the path changes during a file update. We provide 
first-class i1-nodes, which eliminate redundant path 
resolutions, and provide natural support for atomic 
file updates, since you can name them directly. 


File consistency across multiple nodes is very 
limited. We desire a range of consistency and 
durability options, including both strong consistency 
across the cluster with replication, and _ local 
temporary storage. 


There is only one kind of index on files, the 
directory. We would like files to belong to multiple 
indices of diffierent types simultaneously, including 
hash tables, B-trees, and version trees. 


We would like extensible metadata to simplify 
service-specific file operations, such as version 
numbers, TCP or MD5 checksums, and caching/ 
expiration directives. 
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Thus, the basic strategy is to provide a “toolbox” 
local file system that can we reuse in multiple ways to 
build service-specific cluster-based file systems. The 
toolbox deals with entirely local instances of storage, 
called volumes. We provide a simple physical global 
namespace using (node, volume, 1-node) tnplets. 

A “file” consists of an 1-node that bas several values: 
typically a segment, which holds the data, and some 
metadata attributes. The metadata is extensible, which 
allows services to store their own metadata. A “direc- 
tory” is just one kind of index on top of the i-nodes. 

We provide atomic transactions, which in _ turn 
enables files to have multiple indices (or multiple par- 
ents), and simplifies renaming, deletion, and path opera- 
tions. Direct exposure of i-nodes also allows database- 
style iteration through sets of files. 

The real power of the CFS comes from the ease with 
which an author can create file-system like things. To 
build a shared file system across a clone group, the 
author need only define a global namespace. For exam- 
ple, storage for web pages need not have a directory at 
all, and can just use the CHT to store name~i-node 
mappings, thus enabling single seek access to the data 
segment. Or even simpler, a CFS with a fixed number of 
nodes can be built just by using a static hash function to 
map file names to nodes. Thus with little service-level 
code, we can achieve a variety of file systems. 

Replication is completely orthogonal to the parti- 
tioning of a cluster-wide namespace and 1s handled 
quite differently than in the CHT. For a replicated CFS, 
the author must define the replica groups and use 2PC to 
update them. Since operations in the file system are 
atomic, the general 2PC manager can be used to build 
replication easily. This is intentionally quite a different 
policy from the CHT, in which replication was managed 
transparently. We take a different tack in CFS for two 
reasons: |) there are a wider array of strategies for a rep- 
licated file system, making it harder to have any single 
one, and 2) the existence of the CHT makes it really 
easy to manage replica groups within a service, since it 
handles atomicity and recovery of this metadata auto- 
matically. This approach enables powerful service-spe- 
cific CFSs (a service can have more than one) with very 
little application-level code. We have built three differ- 
ent service-specific file systems so far. 

Finally, as a performance optimization, we support 
streaming of data segments across the cluster (versus 
store-and-forward copying). Any stage in the cluster 
may issue a stream task (acting as stream client) to 
another stage (the stream server), thus establishing a vir- 
tual channel within the cluster for reading or writing 
data. This is particularly useful for streaming data 
directly from the CFS out to the wide-area network, and 
is used for both our web and e-mail servers. 
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Figure 5: Ninja Web Service 

Only Front Ends receive external connections, 
Cache nodes serve files locally or retrieve them from 
either the hash table or CFS (two different versions). 





We have not built a generic cluster file system yet, 
although the one in our e-mail server is relatively gen- 
eral and could be packaged up for reuse by other ser- 
vices. We are still leaming about the revised CFS API 
and expect to generalize support for replication and par- 
titioning in the future, which will eliminate the small 
amount of service-level code required now. 


4 Applications 


In this section, we review three applications built 
using the Ninja framework. We then use these applica- 
tions in the next section to evaluate the framework and 
our goals of robustness and ease of authoring. We have 
also built several other applications, including other web 
and mail servers, and a Napster-like file-sharing service. 


4.1 Ninja Web Server 

The prototypical service for Ninja 1s the web server, 
and we thus use it to evaluate all of our goals. The web 
server 1s relatively simple but achieves scalability, high 
availability, graceful degradation, and online evolution. 

We have implemented several web server proto- 
types, serving both static and dynamic pages using 
either the hash table or the CFS for page storage. Our 
latest prototype builds upon the Haboob web server 
[wWCBO!] and modifies it to retrieve pages from the CHT. 
Haboob uses SEDA to handle a large number of simul- 
taneous connections, making it an ideal front-end for the 
Ninja cluster web server. Haboob maintains an in-mem- 
ory cache; we performed minor modifications to the 
cache miss component to fetch page data from the hash 
table instead of from local disk. Adding a thin wrapper 
to make an instance of Haboob behave as a Ninja clone 
allows us to create a clustered web service, with the CM 
directing external HTTP requests to one of the clones. 
Figure 5 shows the structure of the web server. 

The shared persistent state maintained by the CHT 
allows any front-end node to answer any request; the 
CM masks front end failures. The replication policy 
used for the tables storing page data can be tuned to 
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Figure 6: NinjalM Architecture 

Each protocol has a clone group, whose size 
depends on the traffic in that protocol. Ail protocols 
use Message Routers to communicate with each 
other, and all use the CHT for profiles and buddy lists. 


achieve desired tradeoffs between availability and per- 
fonnance, and different classes of pages may be split 
among tables with different replication strategies. Simi- 
larly, front ends may be partitioned to provide differing 
quality of service or achieve better cache performance, 
as described in Section 3.3.2. 


4.2 Universal Instant Messaging Proxy 

NinjalM is an instant-messaging (IM) proxy that 
performs protocol translation among popular instant 
messaging protocols and e-mail. It currently supports 
AIM, ICQ, MSN, Yahoo!, and IMPP protocols. Users 
can use the unmodified MSN client software or our Java 
applet-based client to communicate with users on all 
five IM systems. NinjalM forwards messages bidirec- 
tionally among the five systems, which allows all users 
to reach each other. 

There are several challenges in implementing an IM 
service. First, it must scale to a huge number of connec- 
tions that are mostly idle; AOL’s AIM has over 90M 
registered users [Hu00]. Most IM systems use long-lived 
TCP connections for every active user. Second, it 
requires scalable persistent storage for user profiles and 
buddy lists. Third, it must be able to route messages 
efficiently and process buddy status updates. 

Figure 6 shows the NinjalM architecture. The Con- 
nection Manager enables NinJalM to easily scale up the 
number of connections linearly with nodes, and provide 
high availability. The CHT its used to store user profiles 
and buddy lists, which allows users to connect to any 
node in the cluster. To provide efficient buddy status 
notification, both a forward buddy list and a reverse 
buddy list are stored. In addition, we store the node to 
which a user 1s connected, which ts used to route mes- 
sages between ncades. All of the shared state may be 
cached locally (local soft state) to improve performance. 
Finally, partitions are used to provide better affinity for 
Chat sessions. For example, when a user initiates a chat 
session, all the parties are given the same partition num- 
ber (externalized as a port number) to which to connect. 


General Track: 2002 USENIX Annual Technical Conference 


The CM maps the port number to a single node in the 
nonnal case, but need not in the presence of unusual 
load or faults. 


4.3 NinjaMail 

E-mail is one of the most widely used Internet appli- 
cations, with hundreds of millions of users world-wide. 
Moreover, many of these users are concentrated in large 
e-mail services. AOL currently has over 23 million e- 
mail accounts [Lci00], while Hotmail has over 110 mil- 
lion [WBO1]. 

NinjaMail is a scalable, highly available and extensi- 
ble e-mail service, built on the Ninja architecture. At 
NinjaMail’s core is the MailStore module, a message 
access library that uses the CHT and CFS to store user 
profiles, e-mail messages, and message indices. Built on 
top of this are various access modules, which support 
interaction between users and the message store. We 
have fully functional modules for sending and receiving 
messages via SMTP, and reading messages via POP and 
HTML. Additionally, we have nearly completed an 
implementation of the IMAP protocol. 

Figure 7 shows the architecture for NinjaMail. The 
NinjaMail modules keep all long-lived state in the CHT 
or the CFS. This allows the infrastructure to create and 
destroy clones in response to load changes or faults. 
NinjaMail’s use of the underlying mechanisms 1s illumi- 
nated by examining a typical message cycle: 


Message arrival (SMTP): I) Accept a new SMTP 
connection from the CM, 2) check the CHT, to 
verify that the recipient is a valid user, 3) stream the 
message to the replicated file system (MailStore), 
and 4) use the CHT apply function to add the 
message to the user's message index. 


Message retrieval (POP): I) Accept a new POP 
connection from the CM, 2) check the user’s login 
name and password in the CHT, 3) retrieve the 
user’s message tndex, 4) stream messages from the 
MailStore to the user, as requested, and 5) use the 
CHT apply function to update the persistent copy of 
the message index when the user deletes messages 
or updates the status flags. 


The cluster-based file system of the MailStore is 
built using the CFS to provide atomic local storage vol- 
umes, and the CHT maintains the mappings from parti- 
tions to replica groups, and from replica groups to 
MailStore clones. It uses the 2PC library to update the 
replicas. This gives us a replicated cluster file system 
with very little application code. There are no “directo- 
ries” in the file system; the only index on files is the glo- 
bal hash table that maps replica groups to storage nodes. 
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Figure 7: NinjaMail E-Mail Service 

The service is divided into protocol handlers, each of 
which has its own clone group, and the Mail Store, 
which uses both the CHT and CFS to manage e-mail, 
user and folder information. 
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5 Evaluation 


In this section, we use the above applications to 
evaluate each of our goals: scalability, high availability, 
graceful degradation, online evolution, range of seman- 
tics, and ease of service authoring. 


5.1 Scalability 

For scalability, the overall goal is to support a very 
large number of users. For most services this corre- 
sponds directly to the number of simultaneous connec- 
tions, including the web server, IM server and music 
server. For NinjaMail, scalability is tied more directly to 
messages per second. Ninja also supports linear scaling 
of database size, which comes directly from simple par- 
titioning; we have built services using the CHT with 
more than ITB of storage and over 100 nodes [GBH+00}. 

Figure 8 shows the scalability of NinjaMail in a 
message receipt, storage, and retrieval test. Each cluster 
node functions both as a front-end for SMTP and POP, 
and as a member of the CHT. The cluster nodes used for 
this experiment were 2-way SMPs with S00-Mhz pro- 
cessors and 512 MB of RAM, running Linux 2.4.7. 
Each test was performed with a user base of | million 
times the cluster size. Our test harness executes a simple 
loop. It first selects a random user and node, and sub- 
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Figure 9: NinjalM Scalability 
Scalability is measured in tertns of total throughput of 
IM messages per second. In addition to the n front- 
end nodes there are 2 additional nodes that store the 
replicated CHTs for NinjalM. 
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mits a4K message. Next, it selects another random user 
and node, and uses POP to read and delete all messages 
for that user. The per-node performance is excellent, at 
about 14 times the perfiormance of a typical sendmail 
setup on the same hardware (for the message receipt 
portion), perhaps due the efficiency of the CHT for 
metadata. Extrapolating, we expect NinjaMail (as is) 
should be able to handle the Yahoo! mail workload, 
about 12 billion messages per month {Yah0l], with a 
cluster of around | 00 nodes. 

Figure 9 shows the scalability for NinjalM, mea- 
sured in total throughput of IM messages. Simulated cli- 
ents saturate the server using the ftull MSN IM protocol 
by sending messages every 5 seconds. Each front end 
node ran one message router and one MSN IM server. 
At the peak of 4941 messages/sec (with 8 front ends), 
this corresponds to almost 25,000 simultaneous 
extremely active clients. 

Figure I0 shows the scalability of the web server 
under the SPECweb99 benchmark with 600MB data/ 
node; single node performance is consistent with a solid 
single-node web server such as Apache [Apa01]. Note 
that the Ninja web server is not just a web farm, but 
actually reflects strongly consistent data across the clus- 
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Figure 11: Recovery tme for an Unexpected Death 
This graph shows the recovery process for a two- 
node web server with one failure at time 20. A new 
clone takes over the traffic in six seconds, and 
recovery completes in about 20 seconds. 


ter, and integrated connection management for high 
availability and online evolution. 


5.2 High Availability 

Our goal for high availability is to show that users 
have a high probability of success, and that we can pro- 
vide independent retry for a given query or connection. 
It is not our goal to provide fault-tolerant connections. 
Rather, we forfeit active connections on lost nodes, but 
retries should automatically locate another node and 
work correctly (by using shared replicated state ). 

Figure 11 shows the recovery after an unexpected 
death in a two-node web server. A third clone took over 
the affected traftic within 6 seconds, and the overall 
server was fully recovered in about 20 seconds. 
Response time increased by several seconds during the 
recovery process, and active connections on the dead 
node were lost. 

In the case of graceful shutdown, Ninja normally 
does not drop any connections. This case is covered 
Shortly under the discussion of online evolution, which 
uses controlled shut downs to upgrade a running service 
without drops. 


5.3 Graceful Degradation 

At the service level, we support several strategies for 
graceful degradation. The goal is to react gracefully to 
offered loads that exceed capacity. In practice, peak 
loads can be 5x the average load, making it impractical 
to provision for peak load (Mov99). Even with overprovi- 
sioning, 10x load spikes still occur (WS00)}. 

The default strategy is smply to reject new connec- 
tions when the service is saturated. This preserves the 
maximum throughput, but is not all that graceful. We 
provide three strategies that exploit service-level know|- 
edge, via partitions, to de grade more gracefully. 

The first strategy 1s smply to prioritize partitions 
and assign separate resources for each partition. This 
enables low-priority partitions to be overloaded without 
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Figure 12: Prioritized Admission Control (Overload) 
We start with a 2-node clone group with two 
partitions, 1 and 2; Partition 1 has higher priority. 
Initially both clones handle both partitions. Overload 
is detected by one of the nodes, which initiates 
overload mode. At “1st drop” the CM drops half the 
traffic to the lower priority partition; after a second 
drop, Partition 2 is not admitted at all and Partition 1 
throughput doubles. 


affecting high-priority connections, and enables inde- 
pendent throughput asymptotes and overprovisioning 
ratios. This was shown in Figure 2, in the CM section. 

The second strategy, shown in Figure 12, is to pror- 
itize partitions and drop low-pruority connections during 
overload. We refer to this as prioritized admission con- 
trol. This ensures that under overload the most impor- 
tant connections (or users) are handled first, potentially 
to the exclusion of lower prority connections. An 
improvement would be to drop connections probabilisti- 
cally based on the priority of the partition, but this can 
essentially be done by first using different resources for 
each partition, and then dropping connections indepen- 
dently as each partition becomes overloaded. 

As discussed in Section 3.3.2, Ninja sheds excess 
connections by sending them to a “drop” clone. We 
have not fully explored the power of this mechanism, 
which could be service specific to enable very fine-grain 
admission control, since the drop clone could decide on 
a case-by-case basis to handle some connections. 

The third strategy we employ for graceful degrada- 
tion is to try to serve more requests, but in a degraded 
form, which is possible because clones Anow that they 
are in overload mode. For exampk, a web server might 
serve generic versions of pages rather than personalized 
versions. The degraded service moves out the absolute 
scale limit, at which point further de gradation using one 
of the first two strate gies would have to take place. 


5.4 Online Evolution 

Online evolution is enabled by our ability to shut 
down clones gracefully, without dropped connections. 
Figure 13 shows online evolution between two versions 
of a three-node web server. To upgrade a node, the 
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Figure 13: Online Evolution for a 3-node Web Server 
Each node starts with Version 1 and is then 
gracefully shut down and restarted with Version 2. No 
connections are dropped during the transition. The 
first restart takes longer due to creation of the clone 
group for Version 2. 
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infrastructure first updates the CM to stop sending in 
new connections to the Version | clone. Next, Ninja 
starts a Version 2 clone on the node, which begins 
receiving new connections. The Version | clone exits 
once all established connections have been serviced. By 
repeating this process on all nodes in sequence, we can 
upgrade the cntire cluster with no downtime and no 
dropped connections. 

Note that the two versions typically coexist on the 
same node for some time while the Version 1 clone fin- 
ishes servicing existing connections. This is possible 
because Ninja's virtualization of resources prevents the 
two versions from interfering with each other (other 
than performance). 


5.5 Range of Semantics 

Our support for a range of data consistency seman- 
tics comes primarily from the CHT and from the ability 
to build service-specific file systems. The simplest form 
of this is choosing non-replicated storage, which 1s pos- 
sible with both the CHT and the file system toolkit. We 
have found this useful for local file caches in some of 
our web server implementations and in the Ninja ver- 
sion of Napster (not discussed). 

Figures 3 and 4 show that we can reduce lock con- 
tention if we accept updates that have an inconsistent 
ordering across replicas (using ‘tweak apply”). This 
approach can achieve five times the throughput with 
two replicas under heavy contention. Our primary use 
for this so far has been to maintain unordered lists, 
which are useful in NinjaMail (since internal message 
order is not critical), and in various membership lists, 
such as members of a chat room. 

Although not discussed, we have also exploited 
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Figure 14: Code Size for Services in Lines (Java) 


delayed disk writes in the CFS to improve latency and 
throughput at the expense of a small window for lost 
data (much like NFS updates). Similarly, the CHT 
allows two independent memory copies to be consid- 
ered “durable’’, rather that than more strict definition of 
two copies on disk. In the former case, the disks are 
updated shortly thereafter in the style of group commit. 


5.6 Service Authoring 

Overall, we found it hard to write the underlying 
mechanisms, but easy to write the services, which ful- 
fills our primary goal. Figure 14 shows the code size for 
four applications. The Ninku application, which was not 
discussed, is the Ninja version of the Napster service. In 
comparison, the Ninja infrastructure code is about 
20,000 lines of code, not counting various third-party 
libraries used by the CHT and CFS. These services are 
remarkably small given that they are full-fledged robust 
services. For example, the Porcupine scalable e-mail 
server [SBL99] is about 30,000 lines by itself. Both the e- 
mail and web servers seem to be about one-tenth the 
size of comparable stand-alone applications. 

The primary burden that we did not lighten ts the 
difficulty of authoring protocol code, which presents an 
obvious place for future work. We also found event- 
based programming, used for most internal modules and 
some services to be harder than using threads. 

One other important point is that none of these appli- 
cation require any code for high availability, online evo- 
lution or graceful degradation, although some may have 
a few configuration lines to define partitions (1f used). 


6 Discussion 


In this section, we review the principles behind 
Ninja and discuss related and future work. 

The programming model has three important princi- 
ples. First, we want to exploit the natural parallelism of 
Internet services. The two advantages of this approach 
are that applications fit naturally and that we can ban 
more general types of concurrency, which are histori- 
cally hard to get right and require unknown resources 
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for correctness. In particular, services do not create or 
manage threads. 

Second, like databases, we want to hide the com- 
plexity of fault tolerance, persistence and scalability . 

Third, the explicit use of shared state allows us to 
simplify recovery greatly. The invariant is that anything 
that must survive faults must be kept in the shared state. 
Local state thus requires no recovery, and the shared 
state 1s recovered transparently. We apply the same 
principle recursively to build the shared state mecha- 
nisms: we differentiate storage clones, which require 
recovery, from all other clones, which do not. It 1s the 
clean recovery story that allows us to provide high 
availability, online evolution, and dynamic resizing and 
remapping of clone groups. 

We have also pursued a bottom-up approach that 
provides a range of semantics. For example, the 2PC 
library and CFS are really tools that are used to simplify 
building complex systems. The primary difference 
between the CFS and a traditional file systems 1s exactly 
that the CFS 1s a toolbox with a more appropriate inter- 
face for authoring services that need replicated persis- 
tent storage. Even the CHT is used as a tool to build the 
cluster-based file system of NinjaMail. Similarly, we try 
to make these tools configurable to enable tradeoffs 
among performance, consistency and replication. The 
‘“‘weak apply” function is the best example of this. 

Connection management seems fundamental to 
robust Internet services. In general, there must at least 
be a dynamic mapping between extemal names and cur- 
rently working internal nodes; otherwise failures are 
visible to clients. Partitions enable application input into 
how the CM should prioritize connections. This is 
essentially a use of static type information (partitions 
are usually defined statically) to enable run-time optimi- 
zation during overload. They also provide better cache 
affinity for all kinds of “front end” clone groups. 

The CHT exploits the use of a narrow interfiace to 
simplify the maintenance of consistency. Unlike a 
shared address space, the CHT can only be accessed via 
method calls and thus only needs to ensure consistency 
at these points. This 1s most noticeable in the ability to 
support atomic updates, as a hash table value is simply 
not visible during updates. 


6.1 Related Work 

Lightweight recoverable virtual memory [MMK+94] 
provides an integrated approach to in-memory data 
structures with durability. It could be used to implement 
the non-replicated versions of our shared data structures, 
but does not support replication or 2PC. 

The traditional way to simplify persistent applica- 
tions is to store al] data ina DBMS and use a declarative 
query language for all access and updates. We intention- 
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ally desire a “navigational” rather than relational inter- 
face for better integration with the rest of the service, 
which is navigational. Databases also focus on consis- 
tency under faults at the expense in practice of availabil- 
ity, where we explicitly provide a range of semantics 
and tradeoffs. We find that availability is often more 
important for Internet services than strict consistency. 
DBMSs also provide a large whole solution with little 
ability to customize semantics or make tradeoffs. In 
contrast, we may decide to consider something commit- 
ted if it is in memory on two independent nodes, and 
only later move objects to disk, which increases 
throughput for updates. We believe services should use 
a combination of our techniques and DBMS solutions. 

Object-oriented databases, such as Thor [LAC+96] or 
Persistent Java [ADJ+96], share our use of controlled 
interfaces, and can implement all of our shared data 
structures, although they are more heavyweight and 
generally don’t offer a range of semantics. They are 
strictly more powerful, with support for transactions and 
nested objects. We also find power in our “toolbox” 
approach that has allowed us to build a range of persis- 
tent data structures out of logging, 2PC, and the apply 
function. We also depend on and exploit our partition- 
free, high-performance network (typical for a cluster). 

Application servers, such as BEA’s WebLogic 
[BEA01}], provide persistent shared state by wrapping 
navigational structures around a relational database. 
These servers also target large-scale highly available 
services and were developed concurrently. Application 
servers typically also provide integration with legacy 
systems. Ninja provides better support for in-memory 
data structures, variable semantics, graceful degradation 
and online evolution. Use of Ninja’s techniques would 
complement these servers’ use of RDBMS systems, and 
one vendor Is incorporating some of our techniques. 

“Layer 7” switches, such as the Foundry switch that 
we uSe, provide some aspects of connection manage- 
ment, as does HACC [ZBCS99]}. In particular, they can 
provide load balancing and basic partitioning by URL. 
The primary advantage of Ninja is the integration of the 
control of the manager into the framework. We dynami- 
cally reconfigure the switches as clone groups change, 
and we provide integrated support for online evolution 
and gracefiil degradation. In fact, our dynamic use of 
these switches was clearly novel, as we uncovered many 
new bugs in production hardware that we had to work 
around. 

The Porcupine mail server [SBL99] shares the goals 
of scalability and availability, and even some of the 
techniques for replication and scalability. However, Por- 
cupine is a single application rather than a framework. 
The existence of Ninja makes it easy to write Porcupine: 
NinjaMail has about one-tenth the code size of Porcu- 
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pine for similar functionality and robustness. In addi- 
tion, Ninja is more efficient and allows a wider range of 
performance tradeoffs than were present in Porcupine. 

The TACC framework (FGB97] is a predecessor to 
this work and shares most or our goals, but does not 
address persistent shared state, which is the hardest part. 
It also uses application-specific front ends to do connec- 
tion management, which we avoid. 

The Ninja project (GWv+01} that led to this work has 
a much broader scope, and includes support for distrib- 
uted systems built on top of clusters, which we refer to 
as ‘‘bases” in the overall architecture. Some of the addi- 
tional pieces include support for proxies and end 
devices, such as laptops, phones, or PDAs; support for 
paths that connect these elements; security; and OS and 
proxy support for small devices. There are also papers 
that cover subsets of the work here in greater detail, 
including the CHT [GBH+00,Grb00] and SEDA [WCBO1]. 


6.2 Future Work 

There are at least three key areas of future work: 
ease of authoring, ease of use, and support for shared 
State. Section 5.6 covered some enhancements to ease 
authoring. 

To simplify ease of use, there is much we could do 
to automate online evolution and gracefil degradation. 
Evolution should have an explicit publishing process 
and a way to revert to the previous version easily. 
Graceful degradation is mostly automated, but is still 
very service specific. We don’t help much with how a 
service should define partitions or trade off quality and 
performance. We could also use a unified way to test 
overload conditions and in general administer running 
services. 

Our support for shared state should evolve to include 
true transactions rather than the atomic actions that we 
support now. This is quite a bit harder and the current 
set has proven very useful as is. There is also more we 
can do with the interaction between lock contention and 
2PC, as discussed in Section 3.4.1. Finally, our recovery 
code remains immature due to the difficulty of thorough 
testing. However, it is exactly the complexity of recov- 
ery code that makes it so valuable to build once for the 
framework, rather than separately for each service. The 
automation of recovery is the most valuable aspect of 
the Ninja framework to service authors. 


7 Summary 


Ninja defines a new programming model and then 
uses the model to simplify the implementation of com- 
plex network services. The model exploits the natural 
parallelism of large-scale services and hides the com- 
plexity of threads, locks, shared state, recovery, load 
balancing and graceful degradation. It provides several 


invariants that greatly simplify service authoring: 


¢ Each service has its own virtual cluster, which may 
vary in size transparently over time. We have shown 
linear scalability up to 100 nodes for toy 
applications and to 30 nodes for the e-mail server. 


Services can have many shared namespaces. Each 
namespace provides strongly consistent shared state 
across the nodes of the service. 


Shared state can be persistent and highly available 
with automatic recovery from faults. 


Concurrency is implicit in the programming model, 
which avoids the creation and management of 
threads in applications. Atomicity is provided by the 
shared state primitives and by isolation of 
namespaces and local state. 


Connections and external names are managed 
automatically for load balancing and fault tolerance. 


Overload is detected automatically, which initiates 
gracefial degradation as needed. 


The CM enables online evolution and graceful 
degradation without help from service authors. They 
may use partitions for fine-grain control of both 
quality of service and graceful degradation. 


¢ The CM and highly available shared state together 
enable highly available services. 


Because of these powerful invariants, Ninja services 
remain remarkably simple despite being scalable, highly 
available and persistent. We have been able to write sev- 
eral real services using Ninja, including instant messag- 
ing, a Napster-like file sharing system, and scalable web 
and e-mail servers. In all cases, the code for the service 
was small and relatively simple (e.g. no recovery or log- 
ging code). In the case of e-mail, we achieved a ten 
times reduction in code size for a comparable scalable 
server by using Ninja. 
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Abstract 


A server application is commonly organized as a 
collection of concurrent threads, each of which executes 
the code necessary to process a request. This software 
architecture, which causes frequent control transfers 
between unrelated pieces of code, decreases instruction 
and data locality, and consequently reduces the effiec- 
tiveness of hardware mechanisms such as caches, 
TLBs, and branch predictors. Numerous measurements 
demonstrate this effect in server applications, which 
often utilize only a faction of a modern processor’s 
computational throughput. 


This paper addresses this problem through cohort 
scheduling, a new policy that increases code and data 
locality by batching the execution of similar operations 
arising in different server requests. Effective implemen- 
tation of the policy relies on a new programming ab- 
Straction, staged computation, which replaces threads. 
The StagedServer library provides an efficient imple- 
mentation of cohort scheduling and staged computation. 
Measurements of two server applications written with 
this library show that cohort scheduling can improve 
saver throughput by as much as 20%, by reducing the 
processor cycles per instruction by 30% and L2 cache 
misses by 50%. 


1 Introduction 


A server application is a program that manages ac- 
cess to a shared resource, such as a database, mail store, 
file system, or web site. A server receives a Stream of 
requests, processes each, and produces a stream of re- 
sults. Good server performance is important, as it de- 
termines the latency to access the resource and con- 
strains the server’s ability to handle multiple requests. 
Commercial servers, such as databases, have been the 
focus of considerable research to improve the underly- 
ing hardware, algorithms, and parallelism, as well as 
considerable develop ment to improve their code. 


Much of the hardware effort has concentrated on 
the memory hierarchy, where rapidly increasing proces- 
sor speed and parallelism and slowly declining memory 
access time created a growing gap that is a major pe- 
formance bottleneck in many systems. In recent proces- 


sors, loading a word from memory can cost hundreds of 
cycles, during which three to four times as many in- 
structions could execute. High performance processors 
attempt to alleviate this performance mismatch through 
numerous mechanisms, such as caches, TLBs, and 
branch predictors [27]. These mechanisms exploit a 
well-known program property—spatial and temporal 
reuse of code and data—to keep at hand data that is 
likely to be reused quickly and to predict future pro- 
gram behavior. 


Server software often exhibits Jess program locality 
and, consequently achieves poorer performance, than 
other software. For example, many studies have found 
that commercial database systems running on-line 
transaction processing (OLTP) benchmarks incur high 
rates of cache misses and instruction stalls, which re- 
duce processor performance to as low as a tenth of its 
peak potential [4, 9, 20]. Part of this problem may be 
attributable to database systems’ code size [28], but 
their execution model is also responsble. 


These systems are structured so that a process or 
thread runs for a short period before invoking a block- 
ing operation and relinquishing control, so processors 
execute a succession of diverse, non-looping code seg- 
ments that exhibit little locality. For example, Barroso 
et al. compared TPC-B, an OLTP benchmark whose 
threads execute an average of 25K instructions before 
blocking, against TPC-D, a compute-intensive decision- 
support system (DSS) benchmark whose threads exe- 
cute an average of !.7M instructions before blocking 
[9]. On an AlphaServer 4100, TPC-B had an L2 miss 
rate of 13.9%, an L3 miss rate of 2.7%, and overall per- 
formance of 7.0 cycles per instruction (CPI). By con- 
trast, TPC-D had an L2 miss rate of 1.2%, an L3 miss 
rate of 0.32%, and a CPI of 1.62. 


Instead of focusing on hardware, this paper takes 
an alternative—and complementary—approach of 
modifying a program’s behavior to improve its per- 
formance. The paper presents a new, user-level soft- 
ware architecture that enhances instruction and data 
locality and increases server software performance. The 
architecture consists of a scheduling policy and a pro- 
gramming model. The policy, cohort scheduling, con- 
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secutively executes a cohort of similar computations 
that arise in distinct requests on a server. Computations 
in a cohort, because they are at roughly the same stage 
of processing, tend to reference similar code and data, 
and so consecutively executing them improves program 
locality and increases hardware performance. Staged 
computation, the programming model, provides a pro- 
gramming abstraction by which a programmer can 
identify and group related computations and make ex- 
plicit the dependences that constrain scheduling. Staged 
computation, moreover, has the additional benefits of 
reducing concurrency overhead and the need for expen- 
Sive, error-prone synchronization. 


We implemented this scheduling policy and pro- 
gramming model in a reusable library (StagedServer). 
In two experiments, one with an I/O-intensive server 
and another with a compute-bound server, code using 
StagedServer performed significantly better than 
threaded versions. StagedServer lowered response time 
by as much as 20%, reduced cycles per instruction by 
30%, and reduced L2 cache misses by more than 50%. 

The paper is organized as follows. Section 2 intro- 
duces cohort scheduling and explains how it can 1m- 
prove program performance. Section 3 describes staged 
computation. Section 4 briefly describes the Staged- 
Server library. Section 5 contains performance meas- 
urements. Section 6 discusses related work. 


Computation 


‘Computation 


Processor 
Time Execution 
Threaded Execution 


Cohorts 





Cohort Scheduling 


Figure 1. Cohort scheduling in operation. Shaded boxes indicate 
different computations performed while processing requests on a 
server. Cohort scheduling reorders the computations, so that simi- 
lar ones execute consecutively on a processor, which increases 
program locality and processor performance. 
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2 Cohort Scheduling 


Cohort scheduling is a technique for organizing the 
computation in server applications to improve program 
locality. The key insight is that distinct requests on a 
server execute similar computations. A server can defer 
processing a request until a cohort of computations ar- 
rive at a similar point in their processing and then exe- 
cute the cohort consecutively on a processor (Figure 1). 


This scheduling policy increases opportunities for 
code and data reuse, by reducing the interleaving of 
unrelated computations that causes cache conflicts and 
evicts live cache lines. The approach is similar to loop 
tiling or blocking [19], which restructures a matrix 
computation into submatrix computations that repeat- 
edly reference data before turning to the next subma- 
trix. Cohort scheduling, however, is a dynamic process 
that reorganizes a series of computations on items in an 
input stream, so that similar computations on different 
items execute consecutively. The technique applies to 
uniprocessors and multiprocessors, as both depend on 
program locality to achieve good performance. 


Figure 2 illustrates the results of a simple experi- 
ment that demonstrates the benefit of cohort scheduling 
on a uniprocessor. It reports the cost, per call, of exe- 
cuting different sized cohorts of asynchronous writes to 
random blocks in a file. Each cohort ran consecutively 
on a system whose cache and branch table buffer had 
been flushed. As the cohort increased in size, the cost of 
each call decreased rapidly. A single call consumed 
109,000 cycles, but the average cost dropped 68% for a 
cohort of 8 calls and 82% for a cohort of 64 calls. A 
direct measure of locality, L2 cache misses, also im- 
proved dramatically. With a cohort of 8 calls, L2 misses 
per call dropped to 17% of the initial value and further 
declined to 4% with a cohort of 64 calls. These im- 
provements required no changes to the operating sys- 
tem code; only reordering operations in an application. 
Further improvement requires reductions in OS self- 
conflict misses (roughly 35 per system call), rather than 
amortizing the roughly 1500 cold start misses. 


2.1 Assembling Cohorts 


Cohort scheduling is not irreparably tied to staged 
computation, but many benefits may be lost if a pro- 
grammer cannot explicitly form cohorts. For example, 
consider transparently integrating cohort scheduling 
with threads. The basic idea is simple. A modified 
thread scheduler identifies and groups threads with 
identical next program counter (nPC) values. Threads 
starting at the same point are likely to execute similar 
operations, even if their behavior eventually diverges. 
The scheduler runs a cohort of threads with identical 
nPCs before turning to the next cohort. 
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Figure 2. Performance of cohorts of WriteFileEx system calls in 
Window 2000 Advanced Server (Dell Precision 610 with an Intel 
Pentium Ill processor). The chart reports the cost per call—in proc- 
essor cycles and L2 cache misses—of an asynchronous write to a 
random 4K block in a fite. 


It is easy to believe that this scheme could some- 
times improve performance, and it requires only minor 
changes to a scheduler and no changes to applications. 
It, however, has clear shortcomings. In particular, nPC 
values are a coarse and indirect indicator of program 
behavior. Only threads with identical nPCs end up in a 
cohort, which misses many pieces of code with similar 
behavior. For example, several routines that access a 
data structure might belong in a cohort. Simple exten- 
sions to this scheme, such as using the distance between 
PCs as a measure of similarity, have little connection to 
logical behavior and are perturbed by compiler linking 
and code scheduling. Another disadvantage 1s that co- 
horts start after blocking system calls, rather than at 
application-appropriate points. In particular, compute- 
intensive applications or programs that use asynchro- 
nous I/O cannot use this scheme, as they do not block. 


To correct these shortcomings and properly assem- 
ble a cohort, a programmer must delimit computations 
and identify the ones that belong in a cohort. Staged 
computation provides a programming abstraction that 
neatly captures both dimensions of cohorts. 


3 Staged Computation 


Staged computation 1s a programming abstraction 
intended to replace threads as the construct underlying 
concurrent or parallel programs. Stages offer compel- 
ling perfonnance and correctness advantages and are 
particularly amenable to cohort scheduling. In this 
model, a program is constructed from a collection of 
stages, each of which consists of a group of exported 
operations and private data. An operation 1s an asyn- 
chronous procedure call, so its invocation, execution, 
and reply are decoupled. Moreover, a stage has schedu!- 
ing autonomy, which enables it to control the order and 
concurrency with which Its operations execute. 


A stage 1s conceptually similar to a class in an ob- 
ject-based language, to the extent that it is a program 
structuring abstraction providing local state and opera- 
tions. Stages, however, differ from objects in three ma- 
jor respects. First, operations in a stage are invoked 
asynchronously, so that a caller does not wait for a 
computation to complete, but instead continues and 
rendezvouses later, if necessary, to retrieve a result. 
Second, a stage has autonomy to control the execution 
of its operations. This autonomy extends to deciding 
when and how to execute the computations associated 
with invoked operations. Finally, stages are a control 
abstraction used to organize and process work, while 
objects are a data representation acted on by other enti- 
ties, such as functions, threads, or stages. 


A stage fiicilitates cohort scheduling because it 
provides a natural abstraction for grouping operations 
with similar behavior and locality and the control 
autonomy to implement cohort scheduling. Operations 
in a Stage typically access local data, so that effective 
cohort scheduling only requires a simple scheduler that 
accumulates pending operations to forma cohort. 


Stages provide additional programming advantages 
as well. Because they control their internal concur- 
rency, they promote a programming style that reduces 
the need for expensive, error-prone explicit synchroni- 
zation. Stages, moreover, provide the basis for specif y- 
ing and verifying properties of asynchronous programs. 
This section briefly describes the staged programming 
model. Section 4 elaborates an implementation in a 
C++ class library. 


3.1 Stage Design 


Programmers group operations into a stage for a 
variety of reasons. The first 1s to regulate access to pro- 
gram State (“static” data) by wrapping it in an abstract 
data type. Operations grouped this way form an obvious 
cohort, as they typically have considerable instruction 
and data locality. Moreover, a programmer can control 
concurrency in a stage to reduce or eliminate synchro- 
nization for this data (Section 3.4). 


The second reason is to group logically related op- 
erations to provide a well-rounded and complete pro- 
gramming abstraction. This reason may seem less com- 
pelling than the first, but logically related operations 
frequently share code and data, so collecting them in a 
stage identifies operations that could benefit from co- 
hort scheduling. 


The third is to encapsulate program control logic in 
the form of a finite-state automaton. As discussed be- 
low, a stage’s asynchronous operations easily imple- 
ment the reactive transitions in an event-driven state 
machine. 
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Figure 3. Example of stages and operations. Stage-A runs op-a, 
which invokes two operations in Stage-B and waiting until they 
complete before running op-a’s continuation. 


In practice, designing a program with stages fo- 
cuses on partitioning the tasks into sub-tasks that are 
self-contained, have considerable code and data local- 
ity, and have logical unity. In many ways, this process 
is the control analogue of object-oriented design. 


3.2 Operations 


Operations are asynchronous computations ex- 
ported by a stage. Invocation of an operation only re- 
quires its eventual execution, so the invoker and opera- 
tion run independently. When an operation executes, It 
can invoke any number of child operations on any 
stage, including its own. A parent can wait for its chil- 
dren to finish, retrieve results from their computation, 
and continue processing. Figure 3 shows an operation 
(op-a) running in Stage-A that invokes two operations 
(op-x and op-y) in Stage-B, performs further computa- 
tion, and then waits for its children. After they complete 
and return their results, op-a continues execution and 
processes the children’s results. 


The code within an operation executes sequentially 
and can invoke both conventional (synchronous) calls 
and asynchronous operations. However, once started, 
an operation is non-preemptible and runs until it relin- 
quishes the processor. Programmers, unfortunately, 
must be careful not to invoke blocking operations that 
suspend the thread running operations on a processor. 
An operation that relinquishes the processor to wait for 
an event—such as asynchronous I/O, synchronization, 
or operation completion—instead provides a continua- 
tion to be invoked when the event occurs [14]. 


A continuation consisting of a function and enough 
saved state to permit the computation to resume at the 
point at which it suspended. Explicit continuations are 
the simplest and least costly approach, as an operation 
saves only its live state in a structure called a closure. 
The other alternative, implicit continuations, requires 
the system to save the executing operation’s stack, so 
that it can be resumed. This scheme, similar to fibers, 
simplifies programming, at some performance cost [2]. 
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Asynchronous operations provide low-cost paral- 
lelism, which enables a programmer to express and 
exploit the concurrency within an application. The 
overhead, in time and space, of invoking an operation 1s 
close to a procedure call, as it only entails allocating 
and initializing a closure and passing it to a stage. 
When an operation runs to completion, it does not re- 
quire its own stack or an area to preserve processor 
State, which eliminates much of the cost of threads. 
Similarly, returning a value and re-enabling a continua- 
tion are simple, inexpensive operations. 


3.3 Programming Styles 


Staged computation supports a variety of pro- 
gramming styles, including software pipelining, event- 
driven state machines, bi-directional pipelines, and 
fork-join parallelism. Conceptually, at least, stages in a 
server are arranged as a pipeline in which requests ar- 
rive at one end and responses flow from the other. This 
form of computation 1s easily supported by representing 
a request as an object passed between stages. Linear 
pipelining of this sort is simple and efficient, because a 
stage retains no information on completed computa- 
tions. 


However, stages are not constrained to this linear 
style. Another common programming idiom 1s_ bi- 
directional pipelining, which is the asynchronous ana- 
logue of call and return. In this approach, a stage passes 
subtasks to one or more other stages. The parent stage 
eventually suspends its work on the request, turns its 
attention to other requests, and resumes the original 
computation when the subtasks produce results. This 
style requires that an operation be broken into a series 
of subcomputations, which run when results appear. 
With explicit continuations, a programmer partitions 
the computation by hand, although a compiler could 
easily produce this code, which is close to the well- 
known continuation-passing style [6, 12]. With implicit 
continuations, a programmer only needs to indicate 
where the original computation suspends and waits for 
the subtasks to complete. 


A generalization of this style is event-driven pro- 
gramming, which uses a finite state automaton (FSA) to 
control a reactive system [26, 29]. The FSA logic is 
encapsulated in a stage and is driven by external events, 
such as network messages and I/O completions, and 
internal events from other asynchronous stages. An 
operation’s closure contains the FSA state for a particu- 
lar server request. The FSA changes state when a child 
operation completes or external events arrive. These 
transitions invoke computations associated with edges 
in the FSA. Each computation runs until it blocks and 
specifies the next state in the FSA. 


USENIX Association 


USENIX Association 





Las} as] 312. 
= bd Led On 


Lp aif Bt a 
Rf seo | 2519 | 2108} | | 


iI 
i 
io 


ie 


Figure 4 Profile of staged web server (Section 5.1). The performance metrics for each stage are broken down by processor {the system is 
running on four processors). The first column is the average queue tength. The second column contains three metrics on operations at the 
stage: the quantity, the average wait time (millisecond), and the maximum wait time. The third column contains corresponding metrics for 
operations that are suspended and restarted. The fourth column contains corresponding metrics for completed operations. The numbers 
on arcs are the number of operations started or restarted between two stage processor pairs. 


For example, the web server used in Section 5.1 is 
driven by a control-logic stage consisting of a FSA with 
fifteen states. The FSA describes the process by which 
a HTTP GET request arrives and is parsed, the refer- 
enced file is found in the cache or on disk, the file 
blocks are read and transmitted, and the file and con- 
nection are closed. 


Describing the control logic of a server as a FSA 
opens the possibility of verifying many properties of the 
entire system, such as deadlock freedom, by applying 
techniques, such as model checking [15, 22], developed 
to model and verify systems of communicating FSAs. 


3.4 Scheduling Policy Refinements 


The third attribute of a stage is scheduling auton- 
omy. When a stage is activated on a processor, the stage 
determines which operations execute and their order. 
This scheduling freedom allows several refinements of 
cohort scheduling to reduce the need for synchroniza- 
tion. In particular, we found three policies usefiul: 


e An exclusive stage executes at most one of its op- 
erations at a time. Since operations run sequentially 
and completely, access to stage-local data does not 
need synchronization. This type of a stage is similar 
to a monitor, except that its interface is asynchro- 
nous: clients delegate computation to the stage, 
rather than block to obtain access to a resource. 
When this strict serialization does not cause a per- 
formance bottleneck, this policy offers fast, error- 
free access to data and a simple programming 
model. This approach works well for fine-grained 
operations, as the cost of acquiring and releasing the 


stage’s mutex can be amortized over a cohort of op- 
erations [25]. 


e A partitioned stage divides invocations (based on a 
key passed as a parameter), to avoid sharing data 
among operations running on different processors. 
For example, consider a file cache stage that parti- 
tions requests using a hash function on the file 
number. Each processor maintains its own hash ta- 
ble of in-memory disk blocks. Each hash table is ac- 
cessed by only one processor, which enhances local- 
ity and eliminates synchronization. This policy, 
which is reminiscent of shared-nothing databases, 
permits parallel data structures without fine-grain 
synchronization. 


e A shared stage runs its operations concurrently on 
many processors. Since several operations in a stage 
can execute concurrently, shared data accesses must 
be synchronized. 


Other policies are possible and could be easily imple- 
mented within a stage. 


It is important keep in mind that these policies are 
implemented within the more general framework of 
cohort scheduling. When a stage is activated on a proc- 
essor, it executes its outstanding operations, one after 
another. Nothing in the staged model requires cohort 
scheduling. Rather the programming model and sched- 
uling policy naturally fit together. A stage groups logi- 
cally related operations that share data and provides the 
freedom to reorder computations. Cohort scheduling 
exploits scheduling freedom by consecutively running 
similar operations. 
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3.5 Understanding Performance 


A compelling advantage of the Staged model 1s 
that the performance of the system is relatively easy to 
visualize and understand. Each stage is similar to a 
node in a queuing system. Parameters, such as average 
and maximum queue length, average and maximum 
wait time, and average and maximum processing time, 
are easily measured and displayed (Figure 4). These 
measurements provide a good overview of system per- 
formance and help identify bottlenecks. 


3.6 Stage Computation Example 


As an example of staged computation, consider the 
file cache used by the web server in Section 5.1. A file 
cache is an important component in many servers. It 
stores recently accessed disk blocks in memory and 
maps a file identifier and offset to a disk block. 


The staged file cache consists of three partitioned 
Stages (Figure 5). The cache 1s logically partitioned 
across the processors, so each one manages a unique 
subset of the files, as determined by the hashed file 
identifier. Alternatively, for large files, the file identi- 
fier and offset can be hashed together, so a file’s disk 
blocks are stripped across the table. Within the stage, 
each processor maintains a hash table that maps file 
identifiers to memory-resident disk blocks. Since a 
processor references only its table, accesses require no 
synchronization and data does not migrate between 
processor caches. 


If a disk block is not cached in memory, the cache 
invokes an operation on the //O Averegator stage, 
whose role is to merge requests for adjacent disk blocks 
to improve system efficiency. This stage utilizes cohort 
scheduling in a different way, by accumulating 1/O re- 
quests in acohort and combining them into a larger 1/O 
request on the operating system. 


The Disk //O stage reads and writes disk blocks. It 
issues asynchronous system calls to perform these op- 
erations and, for each, invokes an operation in the Event 
Server stage describing a pending 1/O. This operation 
suspends until the 1/O completes. This stage interfaces 
the operating system’s asynchronous _ notification 
mechanism to the staged programming model. It util- 
izes a Separate thread, which waits on an 1/O Comple- 
tion Port that the system uses to signal completion of 
asynchronous [/O. At each notification, this stage 
matches an event with a waiting closure, which it re- 
enables and passes the information from the Comple- 
tion Port. The Disk I/O stage, in turn, returns disk 
blocks to the I/O Aggregator, which passes them to the 
FileCache stage, where the data are recorded and 
passed back to the client. 
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Figure 5, Architecture of staged file cache. Requests for disk 
blocks are partitioned across processors to avoid sharing the hash 
table. If a block is not found, it is requested from an I/O aggrega- 
tor, which combines requests for adjacent blocks and passes 
them to a disk I/O stage that asynchronously reads the files. When 
ant/O completes, an event serverthread is notified, which passes 
the completion back to the disk I/O stage. 


4 StagedServer Library 


The StagedServer library is a collection of C++ 
classes that implement staged computation and cohort 
scheduling on cither a uniprocessor or multiprocessor. 
This library enables a programmer to define stages, 
operations, and policies by writing only application- 
specific code. Moreover, StagedServer implements an 
aggressive and efficient version of cohort scheduling. 
This section briefly describes the library and its primary 
interfaces. 


The library’s functionality is partitioned between 
two principal classes. The first is the Stage class, which 
provides stage-local storage and mechanisms for col- 
lecting and scheduling operations. The second is the 
Closure class, which encapsulates an operation and its 
continuations, provides per-invocation state, and sup- 
ports invoking an operation and returning its result. The 
fundamental action in a StagedServer system is to in- 
voke an operation by creating and initializing a closure 
and handing it to a stage. 


4.1 Stage Class 


The Stage class is a templated base class that an 
application uses to derive classes for its various stages. 
The base class provides the basic functionality for man- 
aging closures and for scheduling and executing opera- 
tlons on processors. 
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4.1.1 Scheduling Policy 


StagedServer implements a cohort scheduling pol- 
icy, wth enhancements to increase the processor affin- 
ity of data. The assignment of operations to processors 
occurs when an operation s submitted to a stage. By 
default, an operation invoked by code running on proc- 
essor p executes on processor p in subsequent stages. 
This affinity policy enhances temporal locality and re- 
duces cache traffic, as the operation’s data tend to re- 
main in the processor’s cache. However, a program can 
override the default and execute an operation on a dif- 
ferent processor when: the processor to execute the 
operation ts explicitly specified, a stage partitions its 
operations among processors, or a stage uses load bal- 
ancing to redistribute operations. 


A stage maintains a stack and quate for each proc- 
essor in the system. In general, operations originating 
on the local processor are pushed on the stack and op- 
erations from other processors are enqueued on the 
queue. When a Stage starts processing a cohort, it first 
empties its stack in LIFO order, before turing to the 
queue. This scheme has two rationales. Processing the 
most recently invoked operations first increases the 
likelihood that an operation’s data will reside in the 
cache. In addition, the stack does not require synchroni- 
zation, since it 1s only accessed by one processor, which 
reduces the common-case cost of invoking an opera- 
tion. 


4.1.2 Processor Scheduling 


StagedServer currently uses a simple, wavefront 
algorithm to supply processors to stages. A programmer 
specifies an ordering of the stages in an application. In 
waveftont scheduling, processors independently alter- 
nate forward and backward traversals of this list of 
stages. At each stage, a processor executes operations 
pending in its stack and queue. When the operations are 
finished, the processor proceeds to the next stage. If the 
processor repeatedly finds no work, it sleeps for expo- 
nentially increasing periods of time interval. If a proc- 
essor cannot gain access to an Exclusive stage, because 
another processor is already working in the stage, the 
processor skips the stage. 


The alternating traversal order in wavefront sched- 
ulmg corresponds to a common communications 
pattern, in which a stage passes requests to its succes- 
sors, which perform a computation and produce a re- 
sult. It is easy to imagine other scheduling policies, but 
we have not evaluated them, as ths approach works 
well for the applications we have studied. This topic is 
worth fur ther investigation. 


4.1.3 Thresholds 


An orthogonal attribute of a stage is a par of 
thresholds that force StagedServer to activate a stage 1f 
more than a given number of operations are waiting or 
after a fixed interval. When either situation arises, 
StagedServer stops the currently running stage (after it 
completes its operation), runs the threshold-exceed ing 
stage, and then returns to the suspended stage. For sim- 
plicity, an interrupting stage cannot be interrupted, so 
that other stages that exceed their thresholds are de- 
ferred until processing returns to original stage. Thresh- 
olds are particularly useftil for latency-sensitive stages, 
such as those interacting with the operating system, 
which must be regularly supplied with 1/O requests to 
ensure that devices do not go idle. 


Another useful refinement is a feedback mecha- 
nism, by which a stage informs other stages that it has 
sufficient tasks. These other stages can suspend proc- 
essing, effectively turning the processor over to the frst 
stage. So far, voluntary cooperation, rather than hard 
queue limits, has sufficed. 


4.1.4 Partitioned Data 


A partitioned stage typically divides its data, so 
that the operations running on a processor access only a 
non-shared portion. Avoiding sharing eliminates the 
need to synchronize access to the data and reduces the 
cache traffic that results when data are accessed from 
more than one processor. The current system partitions 
a variable—using the well-known technique of privati- 
zation [30]—by storing its values in a vector with an 
entry for each processor. Code uses the processor id to 
index this vector and obtain a private value. 


4.2 Closure Class 


Closure is a templated base class for defining clo- 
sures, which are a combination of code and data 
StagedServer uses closures to implement operat ions and 
their continuations. When an operation is first invoked 
on a Stage, the invoker creates a closure and initializes 
it with parameter values. Later, the stage executes the 
operation by invoking one of the closure’s methods, as 
specified by the operation invocation. This method ts an 
ordinary C++ method. When it returns, the method 
must state whether the operation is complete (and op- 
tionally returns a value), if it is waiting for a child to 
finish, or if it is waiting for another operation to resume 
its execution. 


An operation can invoke operations on other 
stages—its children. The original operation waits for its 
children by providing a continuation routine that the 
system runs when the children finish. This continuation 
routine is simply another method in the original closure. 
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The closure passes arguments between a parent and its 
continuation and results between a child and its parent. 
This process may repeat multiple times, with each con- 
tinuation taking on the role of a parent. In other words, 
these closures are actually multiple-entry closures, with 
an entry for the original operation invocation and en- 
tries for subsequent continuations. In practice, a stage 
treats these methods identically and does not distin- 
guish between an operation and its continuation. 


5 Experimental Evaluation 


To evaluate the benefits of cohort scheduling and 
the StagedServer library, we built two prototypical 
server applications. The first—a web server—is 1/O- 
bound, as its task consists of responding to HTTP GET 
requests by retrieving files from a disk or file cache and 
sending them over a network. The second—a publish- 
subscribe server—is compute bound, as the amount of 
data transferred is relatively small, but the computation 
to match an event against a database of subscriptions ts 
expensive and memory-intensive. 


5.1 1/O-intensive Server 


To compare threads against stages, we imple- 
mented two web servers. The first is structured using a 
thread pool (THWS) and the second uses StagedServer 
(SSWS). We took care to make the two servers efficient 
and comparable and to share common code. In particu- 
lar, both servers use Microsoft Window’s asynchronous 
I/O operations. The threaded server was organized in a 
conventional manner as a thread accepting connections 
and passing them to a pool of 256 worker threads, each 
of which performs the server’s full functionality: pars- 
ing a request, reading a file, and transmitting its con- 
tents. This server used the kernel’s file cache. The 
SSWS server also can process up to 256 simultaneously 
requests. It was organized as a control logic stage, a 
network I/O stage, and the disk I/O and caching stages 
described in Section 3.6. The parametcrs were chosen 
by experimentation and yielded robust performance for 
the benchmark and hardware configuration. 


As a baseline for comparison, we also ran the ex- 
periments on Microsoft’s IIS web server, which 1s a 
highly tuned commercial product. IIS performed better 
than the other servers, but the difference was small, 
which partially validates their implementations. 


Our test configuration consisted of a server and 
three clients. The server was Compaq Proliant DLS80R 
containing four 70OMHz Pentium III-Xeon processors 
(2MB L2 cache) and 4GB of RAM. It had eight 
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1O000RPM SCSI3 disks, connected to a Compaq Smart 
Array controller. The clients ran on Dell PowerEdge 
6350s, each containing four 400MHz Pentium II Xeon 
processors with IGB of RAM. The clients and server 
were connected by a dedicated Gigabit Ethemet net- 
work and both ran Windows 2000 Server (SP1). 


We used the SURGE benchmark, which retrieves 
web pages, whose size, distributions, and reference 
pattern are modeled on actual systems [8]. SURGE 
measures the ability of a web server to process HTTP 
GET requests, retrieve pages from a disk, and send 
them back to a client. This benchmark does not attempt 
to capture the full behavior of a web server, which must 
handle other types of HTTP requests, execute dynamic 
content, perform server management, and log data. To 
increase the load, we run a large configuration, with a 
web site of 1,000,000 pages (20.1 GB) and a reference 
stream containing 6,638,449 requests. A SURGE work- 
load is characterized by User-Equivalents (UEs), each 
of which models one user accessing the web site. We 
found that we could run up to 2000 UEs per client. All 
tests were run with the UE workload balanced across 
the client machines. The reported numbers are for 15 
minutes of client execution, starting with a freshly ini- 
tialized server. 


Figure 6 shows the bandwidth and latency of the 
thread (THWS) and StagedServer (SSWS) servers, and 
compares them against a commercial web server (IIS). 
The first chart contains the number of pages retrieved 
by the clients per second (since requests follow a fixed 
sequence, the number of pages is a measure of band- 
width) and the second chart contains the average la- 
tency, perceived by a client, to retrieve a page. 


Several trends are notable. Under light load, 
SSWS’s performance is approximately 6% lower than 
THWS, but as the load increases, SSWS responds to as 
many as 13% more requests per unit time. The second 
chart, in part, explains this difference. SSWS’s latency 
is higher than THWS’s latency under light load (by a 
factor of almost 20), but as the load increases, SSWS’s 
latency grows only 2.3 times, but THWS’s latency in- 
creases 45 times, to a level equal to SSWS’s. 


The commercial server, Microsoft's IIS, outper- 
formed SSWS by 4-9% and THWS by 0-22%. Its la- 
tency under heavy load was up to 45% better than the 
other servers’ latency. 
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Figure 6. Performance of web servers. These charts show the 
perfonnance of the threaded server (THWS), StagedServer server 
(SSWS), and Microsoft's iS server (iIS). The first records the num- 
ber of web pages received by the clients per second. The second 
records the average ltatency, as perceived by the client, to retrieve 
a page. The error bars are the standard deviation of the latency. 


SS WS performance, which is more stable and pre- 
dictable under heavy load than the threaded server, is 
appropriate for servers, in which performance chal- 
lenges arise as offered load increases. SSWS server’s 
overall performance was relatively better and its proc- 
essor performance degraded less under load than the 
THWS server. The improved processor performance 
was reflected in a measurably improved throughput 
under load. 


5.2 Compute-Bound Server 


To evaluate the performance of StagedServer on a 
compute-bound application, we also built a simple pub- 
lish-subscribe server. The server used an efficient, 
cache-friendly algorithm to match events against an in- 
core database of subscriptions [16]. A subscription 1s a 
conjunction of terms comparing variables against inte- 
ger. An event is a set of assignments of values to vari- 
ables. An event matches a subscription if all of its terms 
are satisfied by the value assignments in the event. 


Both the threaded (THPS) and StagedServer 
(SSPS) version of this application shared a common 
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Figure 7. Processor performance of servers. These charts show 
the processor performance of the threaded {THWS) and Staged- 
Server (SSWS) web server. The first chart shows the cycles per 
instruction (CPI) and the second shows the rate of L2 cache 
misses. 


publish-subscribe implementation; the only difference 
between them was the use of threads or stages to struc- 
ture the computation. The benchmark was the Fabret 
workload: 1,000,000 subscriptions and 100,000 events. 
The platform was the same as above. 


The response time of the StagedServer version to 
events was better under load (Figure 8). With four or 
more clients publishing events, the THPS responded in 
an average of 0.57 ms to each request. With four cli- 
ents, SSPS responded in an average time of 0.53 ms, 
and its response improved to 0.47 ms with 16 or more 
clients (21% improvement over the threaded version). 


In large measure, this improvement is due to 1m- 
proved processor usage (Figure 8). With 16 clients, 
SSPS averaged 2.0 cycles per instruction (CPI) over the 
entire benchmark, while THPS averaged 2.7 CPI (26% 
reduction). Over the compute-intensive event matching 
portion, SSPS averaged 1.7 CPI, while THPS averaged 
2.5 CPI (33% reduction). In large measure, this im- 
provement is attributable to a greater than 50% reduc- 
tion in L2 caches misses, from 58% of user-space L2 
cache requests (THPS) to a 26% of references (SSPS). 
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Figure 8, Performance of Publish-Subscribe server. The top chart 
records the average response time to match publish events 
against subscriptions. The bottom chart compares average cycles 
per instruction (CPI) of the thread and StagedServer versions over 
25 second intervals. The initial part of each curve is the construc- 
tion of internal data structures, while the flat part of the curves is 
the event matching. 


This application references a large data structure 
(approximately 66.7MB for the benchmark). When 
matching an event against subscriptions, Fabret’s algo- 
rithm, although cache-efficient, may access a large 
amount of data, and the particular locations are data 
dependent. StagedServer’s performance advantage re- 
sults from two factors. First, its code is organized so 
that only one processor references a given subset of the 
subscriptions, which reduces the number of distinct 
locations a processor references, and hence increases 
the possibility of data reuse. Without this locality opti- 
mization, SSPS runs at the same speed as THPS. Sec- 
ond, StagedServer batches cohorts of event matches in 
this data structure. We measured the benefit of cohort 
scheduling by limiting cohort size. Cohorts of four 
items reduced SSPS performance by 21%, ten items 
reduced performance by 17%, and twenty items re- 
duced performance by 9%. 


Both optimizations would be beneficial in threaded 
code, but the structure of the resulting server would be 
isomorphic to the StagedServer version, with a thread 
bound to each processor performing event lookups ona 
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subset of the data structure, and a queue in front of each 
process to accumulate a cohort 


6 Related Work 


The advantages and disadvantages of threads and 
processes are widely known [5]. More recently, several 
papers have investigated alternative server architec- 
tures. Chankhunthod et al. described the Harvest web 
cache, which uses an event-driven, reactive architecture 
to invoke computation at transitions in its state-machine 
control logic [13]. The system, like StagedServer, uses 
non-blocking 1/O; careful avoidance of page faults; and 
a non-blocking, non-preemptive scheduling policy [7, 
26]. Pai proposed a four-fold characterization of server 
architectures: multi-process, multi-threaded, single- 
process event driven, and asymmetric multi-process 
event driven [26]. These alternatives are orthogonal to 
the task scheduling policy, and as the discussion in Sec- 
tion 2 illustrates, cohort scheduling could increase their 
locality. Pai’s favored event-driven programming style 
offers many opportunities for cohort scheduling, since 
event handlers partition a computation into distinct, 
easily identifiable subcomputations with clear operation 
boundaries. On the other hand, ad-hoc event systems 
offer no obvious way to group handlers that belong in 
the same cohort or to associate data with operations. 
Section 3 describes staged computation, a programming 
model that provides a programmer with control over the 
computation in a cohort. 


Welsh recently described the SEDA system, which 
is Similar to the staged computation model [29]. SEDA, 
unlike StagedServer, does not use explicit cohort 
scheduling, but instead uses stages as an architecture 
for structuring event-driven servers. His performance 
results are similar for 1/O intensive server applications. 


Blackwell used blocked layer processing to im- 
prove the instruction locality of a TCP/IP stack [10]. He 
noted that the TCP/IP code was larger than the MIPS 
R2000 instruction cache, so that when the protocol 
stack processed a packet completely, no code from the 
lower protocol layers remained in cache for the next 
packet. His solution was to process several packets to- 
gether at each layer. The modified stack had a lower 
cache miss rate and reduced processing latency. Black- 
well related his approach to blocked matrix computa- 
tions [19], but his focus was instruction locality. Cohort 
scheduling, whose genesis predates Blackwell, is a 
more general scheduling policy and system architec- 
ture, which is applicable when a computation is not as 
cleanly partionable as a network stack. Moreover, co- 
hort scheduling improves data, not just instruction, lo- 
cality and reduces synchronization as well. 
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A stage is similar in some respects to an object in 
an object-based language, in that it provides both local 
state and operations to manipulate it. The two differ 
because objects are, for the most part, passive and their 
methods are synchronous—though asynchronous object 
models exist. Many object-oriented languages, such as 
Java [17], integrate threads and synchronization, but the 
active entity remains a thread synchronously run a 
method on a passive object. By contrast, in staged com- 
putation, a stage is asked to perform an operation, but 1s 
given the autonomy to decide how and when to actually 
execute the work. This decoupling of request and re- 
sponse is valuable because it enables a stage to control 
its concurrency and to adopt an efficient scheduling 
policy, such as cohort scheduling. 


Stages are similar in some respects to Agha’s Ac- 
tors [3]. Both start with a model of asynchronous com- 
munication between autonomous entities. Actors have 
no internal concurrency and do not give entities control 
over their scheduling, but instead presume a reactive 
model in which an Actor responds to a message by in- 
voking a computation. Stages, because of the internal 
concurrency and scheduling autonomy, are better suited 
to cohort scheduling. Actors are, in turn, an instance of 
dataflow, a more general computing model [23, 24]. 
Stages also can be viewed as an instance of dataflow 
computation. 


Cilk is language based on a provably efficient 
scheduling policy [11]. The language is thread, not ob- 
ject, based, but it shares some characteristics with 
stages. In both, once started, a computation 1s not pre- 
empted. While running, a computation can spawn off 
other tasks, which return their results by invoking a 
continuation. However, Cilk’s work stealing scheduling 
policy does not implement cohort scheduling, nor is it 
under program control. Recent work, however, has im- 
proved the data locality of work stealing scheduling 
algorithms [1]. 


JAWS is an object-oriented framework tor writing 
web servers [18]. It consists of a collection of design 
patterns, which can be used to construct servers adapted 
to a particular operating system by selecting an appro- 
priate concurrency mechanism (processes or threads), 
creating a thread pool, reducing synchronization, cach- 
ing files, using scatter-gather 1/O, or employing various 
http and TCP-specific optimizations. StagedServer is a 
simpler library that provides a programming model that 
directly enhances program locality and performance. 


An earlier version of this work was published as a 
short, extended abstract [21]. 


7 Conclusion 


Servers are commonly structured as a collection of 
parallel tasks, each of which executes all the code nec- 
essary to process a request. Threads, processes, or event 
handlers underlie the software architecture of most 
servers. Unfortunately, this software architecture can 
interact poorly with modern processors, whose per- 
formance depends on mechanisms---caches, TLBs, and 
branch predictors—that exploit program locality to 
bridge the increasing processor-memory performance 
gap. Servers have little inherent locality. A thread typi- 
cally runs for a short and unpredictable amount of time 
and is followed by an unrelated thread, with its own 
working set. Moreover, servers interact frequently with 
the operating system, which has a large and disruptive 
working set. The poor processor performance of servers 
is a natural consequence of their threaded architecture. 


As a remedy, we propose cohort scheduling, which 
increases server locality by consecutively executing 
related operations from different server requests. Run- 
ning similar code on a processor increases instruction 
and data locality, which aids hardware mechanisms, 
such as cache and branch predictors. Moreover, this 
architecture naturally issues operating system requests 
in batches, which reduces the system’s disruption. 


This paper also describes the staged computation 
programming model, which supports cohort scheduling 
by providing an abstraction for grouping related opera- 
tions and mechanisms through which a program can 
implement cohort scheduling. This approach has been 
implemented in the StagedServer library. In a series of 
tests using a web server and publish-subscribe server, 
the StagedServer code performed better than threaded 
code, with a lower level of cache misses and instruction 
stalls and better performance under heavy load. 
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Abstract. This paper presents, EtE monitor, a novel ap- 
proach to measuring web site performance. Our system pas- 
stvely collects packet traces from a server site to determine 
service performance characteristics. We introduce a two-pass 
heuristic method and a statistical filtering mechanism to accu- 
rately reconstruct different client page accesses and to mea- 
sure performance characteristics integrated across all client 
accesses. Relative to existing approaches, EtE monitor offers 
the following benefits: 1) a breakdown between the network 
and server overhead of retrieving a web page, i1) longitudi- 
nal information for all client accesses, not just the subset 
probed by a third party, iit) characteristics of accesses that 
are aborted by clients, and iv) quantification of the benefits 
of network and browser caches on server performance. Our 
initial implementation and performance analysis across two 
sample sites confirm the utility of our approach. 


1 Introduction 


Today, Internet services are delivering a large array of 
business, government, and personal services. Similarly, 
mission critical operations, related to scientific instru- 
mentation, military operations, and health services, are 
making increasing use of the Internet for delivering infor- 
mation and distributed coordination. However, the best 
effort nature of Internet data delivery, changing client 
and network connectivity characteristics, and the highly 
complex architectures of modern Internet services make 
it very difficult to understand the performance charac- 
teristics of Internet services. In a competitive landscape, 
such understanding is critical to continually evolving and 
engineering Internet services to match changing demand 
levels and client populations. 


*This work was originated and largely completed while Y. Fu 
worked at HPLabs during the summer 2001 and supported in part 
by research grant from HP. A. Vahdat and Y. Fu are supported in 
part by the National Science Foundation (EIA-9972879). A. Vah- 
dat is also supported by an NSF CAREER award (CCR-9984328). 


Currently, there are two popular techniques for bench- 
marking the performance of Internet services. The first 
approach, active probing [13, 17, 23, 19], uses machines 
from fixed points in the Internet to periodically request 
one or more URLs from a target web service, record end- 
to-end performance characteristics, and report a time- 
varying summary back to the web service. The second 
approach, web page instrumentation [8, 10, 2, 20], asso- 
ciates code (e.g., JavaScript) with target web pages. The 
code, after being downloaded into the client browser, 
tracks the download time for individual objects and re- 
ports performance characteristics back to the web site. 

In this paper, we present a novel approach to mea- 
suring web site performance called EtE monitor. Our 
system passively collects network packet traces from 
the server site to enable either offline or online analy- 
sis of system performance characteristics. Using two- 
pass heuristics and statistical filtering mechanisms, we 
are able to accurately reconstruct individual page com- 
position without parsing HTML files or obtaining out- 
of-band information about changing site characteristics. 
Relative to existing techniques, EtE monitor offers a 
number of benefits: 


e Our system can determine the breakdown between 
the server and network overhead associated with 
retrieving a web page. This information is nec- 
essary to understand where performance optimiza- 
tions should be directed, for instance to improve 
server-side performance or to leverage existing con- 
tent distribution networks (CDNs) to improve net- 
work locality. 


e EtE monitor tracks all accesses to web pages for a 
given service. Many existing techniques are typi- 
cally restricted to a few probes per hour to URLs 
that are pre-determined to be popular. Our ap- 
proach is much more agile to changing client ac- 
cess patterns. What real clients are accessing de- 
termines the performance that EtE' monitor eval- 
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uates. Finally, gven the Zipf popularity of service 
web pages [1], our approach is able to track the char- 
acteristics of the heavy tail that often makes up a 
large overall portion of web site accesses. 


e Given information on all client accesses, clustering 
techniques [15] can be utilized to determine network 
performance characteristics by network region or 
autonomous system. System administrators can use 
this information to determine which content distri- 
bution networks to partner with (depending on their 
points of presence) or to determine multi-homing 
strategies with particular ISPs. 


e EtE monitor captures information on page requests 
that are manually aborted by the client, either be- 
cause of unsatisfactory web site performance or spe- 
cific client browsing patterns (e.g., clicking on a link 
before a page has completed the download process). 
Existing techniques cannot model user interactions 
in the case of active probing or miss important as- 
pects of web site performance such as TCP connec- 
tion establishment in the case of web page instru- 
mentation. 


e Finally, EtE monitor is able to determine the ac 
tual benefits of both browser and network caches. 
By learning the likely composition of individual web 
pages, our system can determine when certain em- 
bedded objects of a web page are not requested 
and conclude that those objects were retrieved from 
some cache in the network. 


This paper presents the architecture and implementa- 
tion of our prototype EtE monitor. It also highlights 
the benefits of our approach through an evaluation of 
the performance of two sample network services using 
EtE monitor. Overall, we believe that detailed perfor- 
mance information will enable network services to dy- 
namically react to changing access patterns and system 
characteristics to best match client QoS expectations. 
Depending on the architecture of the system, a front end 
“Layer-7” switch [18} could redirect requests for parti cu- 
lar objects to asmaller or larger set of back-end machines 
based on observed performance summaries. Similarly, 
performance characteristics across multiple services be- 
ing served from a single hosting center can be used to 
allocate resources to competing services to, for example, 
maximize aggregate throughput or to maintain higher- 
level service level agreements [4]. Sites may also use 
performance information to dynamically adjust system 
consistency [25] or content fidelity [3] with the goal of 
meeting target levels of performance. 

The rest of this paper is organized as follows. In the 
next section, we survey existing techniques and products 
and discuss their merits and drawbacks. Section 3 out- 
lines the EtE monitor architecture, with additional de- 
tails in Sections 4-6. In Section 7, we present the results 
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of two performance studies, which have been performed 
to test and validate EtE monitor and its approach. Sec- 
tion 8 presents two specially designed experiments to 
validate the accuracy of EtE monitor performance mea- 
surements and its page access reconstruction power. We 
discuss the limitations of the proposed technique in Sec- 
tion 9 and present our condusions and future work in 
Section 10. 

Acknowledgments: Both the tool and the study 
would not have been possible without generous help of 
our HP colleagues: Mike Rodriquez, Steve Yonkaitis, 
Guy Mathews, Annabelle Eseo, Peter Haddad, Bob 
Husted, Norm Follett, Don Reab, and Vincent Rabiller. 
Their help is highly appreciated. Our special thanks to 
Claude Villermain who helped to identify and to correct 
a subtle bug for dynamic page reconstruction. We would 
like to thank the anonymous referees for useful remarks 
and insightful questions, and our shepherd Jason Nieh 
for constructive suggestions to improve the content and 
presentation of the paper. 


2 Related Work 


A number of companies use active probing techniques to 
offer measurement and testing services today, including 
Keynote [13], Net Mechanic [17], Software Research [23], 
and Porivo Technologies [19]. Their solutions are based 
on periodic polling of web services using a set of ge 
ographically distributed, synthetic clients. In general, 
only a few pages or operations can typically be tested, 
potentially reflecting only a fraction of all user’s expe- 
rience. Further, active probing techniques cannot typi- 
cally capture the potential benefits of browser and net- 
work caches, in some sense reflecting “worst case” per- 
formance. From another perspective, active probes come 
from a different set of machines than those that actually 
access the service. Thus, there may not always be cor- 
relation in the performance/reliability reported by the 
service and that experienced by end users. Finally, it 
is more difficult to determine the breakdown between 
network and server-side performance using active prob- 
ing, making it more difficult for customers to determine 
where best to place their optimization efforts. 

Another popular approach is to embed instrumen- 
tation code with web pages to record access times 
and report statistics back to the server. For instance, 
WTO (Web Transaction Observer) from HP OpenView 
suite [8] uses JavaScript to implement this functionality. 
With additional web server instrumentation and cookie 
techniques, this product can record the server processing 
time for a request, enabling a breakdown between server 
and network processing time. A number of other prod- 
ucts and proposals {10, 2, 20] employ similar techniques. 
Relative to our approach, web page instrumentation can 
also capture end-to-end performance information from 
real clients, except connection establishment times (po- 
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tentially an important aspect of overall performance). 
Further, this approach requires additional server-side in- 
strumentation and dedicated resources to actively collect 
performance reports from clients. 

There have been some earlier attempts to passively 
estimate the response time observed by clients from net- 
work level information. SPAND [21, 22} determines net- 
work characteristics by making shared, passive measure- 
ments from a collection of hosts and uses this informa- 
tion for server selection, i.e. for routing client requests 
to the server with the best observed response time in a 
geographically distributed web server cluster. 


3 EtE Monitor Architecture 


EtE monitor consists of four program modules shown in 
Figure 1: 
Web Page 


Session 
ee eg 


Network ae ee 


Trace 


Network Request- Web 
Packet Response Page 
Collector SeaynSEetcr Reconstruction 


Figure 1: EtE Monitor Architecture. 
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1. The Network Packet Collector module collects the 
network packets using tcpdump[24] and records 
them to a Network Trace, enabling offline analysis. 


2. In the Request-Response Reconstruction module, 
EtE monitor reconstructs all TCP connections from 
the Network Trace and extracts HTTP transactions 
(a request with the corresponding response) from 
the payload. EtE monitor does not consider en- 
crypted connections whose content cannot be an- 
alyzed. After obtaining the HTTP transactions, 
the monitor stores some HTTP header lines and 
other related information in the Transaction log 
for future processing (excluding the HTTP pay- 
load). To rebuild HTTP transactions from TCP- 
level traces, we use a methodology proposed by 
Feldmann [7| and described in more detail and ex- 
tended to work with persistent HTTP connections 
by Krishnamurthy and Rexford [14]. 


3. The Web Page Reconstruction module is responsible 
for grouping underlying physical object retrievals 
together into logical web pages and stores them in 
the Web Page Session Log. 


4. Finally, the Performance Analysis and Statistics 
module summarizes a variety of performance char- 
acteristics integrated across all client accesses. 


EtE monitor can be deployed in several different ways. 
First, it can be installed on a web server as a software 





component to monitor web transactions on a particular 
server. However, our software would then compete with 
the web server for CPU cycles and I/O bandwidth (as 
quantified in Section 7). Another solution is to place EtE 
monitor as an independent network appliance at a point 
on the network where it can capture all HTTP transac- 
tions for a web server. If a web site consists of multiple 
web servers, EtE monitor should be placed at the com- 
mon entrance and exit of all web servers. If a web site 
is supported by geographically distributed web servers, 
such a common point may not exist. Nevertheless, dis- 
tributed web servers typically use “sticky connections” , 
i.e., once the client has established a connection with 
a web server, the subsequent client requests are sent to 
the same server. In this case, EtE monitor can still be 
used to capture a flow of transactions to a particular 
geographic site. 


4 Request-Response Reconstruc- 
tion Module 


As described above, the Request-Response Reconstruc- 
tion module reconstructs all observed T'CP connections. 
The TCP connections are rebuilt from the Network Trace 
using client IP addresses, client port numbers, and re- 
quest (response) T'CP sequence numbers. Within the 
payload of the rebuilt TCP connections, HTTP transac- 
tions can be delimited as defined by the HTTP protocol. 
Meanwhile, the timestamps, sequence numbers and ac- 
knowledged sequence numbers for HTTP requests can 
be recorded for later matching with the corresponding 
HTTP responses. 

When a client clicks a hypertext link to retrieve a 
particular web page, the browser first establishes a T'CP 
connection with the web server by sending aSYN packet. 
If the server is ready to process the request, it accepts 
the connection by sending back a second SYN packet ac- 
knowledging the client’s SYN !. At this point, the client 
is ready to send HTTP requests to retrieve the HTML 
file and all embedded objects. For each request, we are 
concerned with the timestamps for the first byte and 
the last byte of the request since they delimit the re- 
quest transfer time and the beginning of server process- 
ing. We are similarly concerned with the timestamps of 
the beginning and the end of the corresponding HTTP 
response. 

EtE monitor detects aborted connections by observ- 
ing either a RST packet sent by an HTTP client to ex- 
plicitly indicate an aborted connection or by a FIN/ACK 


1Whenever EtE monitor detects a SYN packet, it considers the 
packet as a new connection iff it cannot find a SYN packet with 
the same source port number from the same IP address. A re- 
transmitted SYN packet is not considered as a newly established 
connection. However, ifa SYN packet is dropped, e.g. by interme- 
diate routers, there is no way to detect the dropped SYN packet 
on server side. 
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packet sent by the client where the acknowledged se 
quence number is less than the observed maximum se 
quence number sent from the server. After reconstruct- 
ing the HTTP transactions (a request and the cor- 
responding response), the monitor records the HTTP 
header lines in the Transaction Log and discards the ac- 
tual body of the HTTP response. 

Each entry in the log includes a number of fields: (1) 
a unique flow ID for the TCP connection, (2) the client’s 
IP address, (3) the requested URL, (4) the content type, 
(5) the referer field, (6) the via field, (7) whether the 
request was aborted, (8)the number of packets resent 
during the connection (potentially an indication of the 
presence of network congestion), (9) the size and times- 
tamps of the request and response. Some fields in the 
entry are used to rebuild web pages, while other fields 
can be used to measure end-to-end performance. 

An alternative way to collect most of the fields of the 
Transaction Log entry is to extend web server function- 
ality. Apache, Netscape and IIS all have appropriate 
APIs. Most of the fields in the Transaction Log can 
be extracted via server instrumentation. This approach 
has some merits: 1) since a web server deals directly 
with request-response processing, the reconstruction of 
TCP connections becomes unnecessary; 2) it can handle 
encrypted connections. 

However, the primary drawback of this approach is 
that web servers must be modified in an application spe- 
cific manner. Our approach is independent of any partic- 
ular server technology. On the other hand, instrumenta- 
tion solutions cannot obtain network level information, 
such as the connection setup time and the resent packets, 
which can be observed by EtE monitor. 


5 Page Reconstruction Module 


To measure the client perceived end-to-end response 
time for retrieving a web page, one needs to identify 
the objects that are embedded in a particular web page 
and to measure the response time for the client requests 
retrieving these embedded objects from the web server. 
Although we can determine some embedded objects of a 
web page by parsing the HTML for the “container ob- 
ject”, some embedded objects cannot be easily discov- 
ered through static parsing. For example, JavaScript is 
used in web pages to retrieve additional objects. With- 
out executing the JavaScript, it may be difficult to dis- 
cover the identity of such objects. 

Automatically, determining the content of a page re- 
quires a technique to delimit individual page accesses. 
One recent study [6] uses an estimate of client think time 
as the delimiter between two pages. While this method 
is simple and useful, it may be inaccurate in some im- 
portant cases. For example, consider the case where a 
client opens two web pages from one server at the same 
time. Here, the request& for the two different web pages 
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interleave each other without any think time between 
them. Another case is when the interval between the 
requests for objects within one page may be too long to 
be distinguishable from think time (perhaps because of 
the network conditions). 

Different from previous work, our methodology uses 
heuristics to determine the objects composing a web 
page, and applies statistics to adjust the results. EtE 
uses the HTTP referer field as a major “clue” to group 
objects into a web page. The referer field specifies 
the URL from which the requested URL was obtained. 
Thus, all requests for the embedded objects in an HTML 
file are recommended to set the referer fields to the URL 
of the HTML file. However, since the referer fields are 
set by client browsers, not all browsers set the fields. To 
solve this, EtE monitor first builds a Knowledge Base 
from those requests with referer fields, and uses more 
aggressive heuristics to group the requests without 7ef- 
erer fields based on the Knowledge Base information. 

Subsection 5.1 outlines Knowledge Base construction 
of web page objects. Subsection 5.2 presents the algo- 
rithm and technique to group the requests in web page 
accesses using Knowledge Base information and a set of 
additional heuristics. Subsection 5.3 introduces a statis- 
tical analysis to identify valid page access patterns and 
to filter out incorrectly constructed accesses. 


5.1 Building a Knowledge Base of Web 
Page Objects 


The goal of this step is to reconstruct a special subset of 
web page accesses, which we use to build a Knowledge 
Base about web pages and the objects composing them. 
Before grouping HT'T'P transactions into web pages, EtE 
monitor first sorts all transactions from the Transaction 
Log using the timestamps for the beginning of the re 
quests in increasing time order. Thus, the requests for 
the embedded objects of a web page must follow the 
request for the corresponding HTML file of the page. 
When grouping objects into web pages (here and in the 
next subsection), we consider only transactions with suc- 
cessful responses, i.e. with status code 200 in the re- 
sponses. 

The next step is to scan the sorted transaction log 
and group objects into web page accesses. Not all the 
transactions are useful for the Knowledge Base construc- 
tion process. During this step, some of the Transaction 
Log entries are excluded from our current consideration: 


e Content types that are known not to contain embec- 
ded objects are excluded from the knowledge base, 
e.g., application/postscript, application/x-tar, appli- 
cation/pdf, application/zip and tert/plain. For the 
rest of the paper, we call them independent, single 
page objects. 


e If the referer field of a transaction is not set and its 
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content type is not tert/html, EtE monitor excludes 
it from further consideration. 


To group the rest of the transactions into web page ac- 
cesses, we use the following fields from the entries in the 
Transaction Log: the request URL, the request referer 
field, the response content type, and the client IP ad- 
dress. EtE monitor stores the web page access informa- 
tion into a hash table, the Client Access Table depicted 
in Figure 2, which maps a client’s IP address to a Web 
Page Table containing the web pages accessed by the 
client. Each entry in the Web Page Table is a web page 
access, and composed of the URLs of HTML files and 
the embedded objects. Notice that EtE monitor makes 
no distinction between statically and dynamically gener- 
ated HTML files. We consider embedded HTML pages, 
e.g. framed web pages, as separate web pages. 


Oh ject 





Web Page Table 





Client Access Table 


Figure 2: Client Access Table. 


When processing an entry of the Transaction Log, EtE 
monitor first locates the Web Page Table for the client’s 
IP inthe Client Access Table. Then, EtE monitor han- 
dles the transaction according to its content type: 

1. Ifthe content type is text/html, EtE monitor treats 
it as the beginning of a web page and creates a new web 
page entry in the Web Page Table. 

2. For other content types, EtE monitor attempts 
to insert the URL of the requested object into the web 
page that contains it according to its referer field. If 
the referred HTML file is already present in the Web 
Page Table, EtE monitor appends this object at the end 
of the entry. If the referred HTML file does not exist 
in the client’s Web Page Table, it means that the client 
may have retrieved a cached copy of the object from 
somewhere else between the client and the web server. 
In this case, EtE monitor first creates a new web page 
entry in the Web Page Table for the referred HTML file. 
Then it appends the considered object to this page. 

From the Client Access Table, EtE monitor deter- 
mines the content template of any given web page as 
a combined set of all the objects that appear in all the 
access patterns for this web page. Thus, EtE monitor 
scans the Client Access Table and creates a new hash ta- 
ble, as shown in Figure 3, which is used as a Knowledge 


Base to group the accesses for the same web pages from 
other client’s browsers that do not set the referer fields. 
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Figure 3: Knowledge Base of web pages. 


5.2 Reconstruction of Web Page Ac- 
cesses 


With the help of the Knowledge Base, EtE monitor pro- 
cesses the entire Transaction Log again. This time, EtE 
monitor does not exclude the entries without referer 
fields. Using data structures similar to those introduced 
in Section 5.1, EtE monitor scans the sorted Transaction 
Log and creates a new Client Access Table to store all ac- 
cesses as depicted in Figure 2. For each transaction, EtE 
monitor locates the Web Page Table for the client’s IP 
in the Client Access Table. Then, EtE monitor handles 
the transaction depending on the content type: 

1. If the content type is text/html, EtE monitor cre- 
ates a new web page entry in the Web Page Table. - 

2. If a transaction is an independent, single page ob- 
ject, EtE monitor marks it as individual page without 
any embedded objects and allocates a new web page en- 
try in the Web Page Table. 

3. For other content types that can be embedded in 
a web page, EtE monitor attempts to insert it into the 
web page that contains it. 


e If the referer field is set for this transaction, EtE 
monitor attempts to locate the referred page in the 
following way. If the referred HTML file is in an ex- 
isting page entry in the Web Page Table, EtE mon- 
itor appends the object at the end of the entry. If 
the referred HTML file does not exist in the client’s 
Web Page Table, EtE monitor first creates a new 
web page entry in the table for the referred page 
and marks it as nonexistent. Then it appends the 
object to this page. If the referer field is not set 
for this transaction, EtE monitor uses the following 
policies. With the help of the Knowledge Base, EtE 
monitor checks each page entry in the Web Page 
Table from the latest to earliest. If the Knowledge 
Base contains the content template for the checked 
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page and the considered object does not belong to 
it, EtE monitor skips the entry and checks the next 
one until a page containing the object is found. If 
such an entry is found, EtE monitor appends the 
object to the end of the web page. 


e If none of the web page entries in the Web Page 
Table contains the object based on the Knowledge 
Base, EtE monitor searches in the client’s Web Page 
Table for a web page accessed via the same flow ID 
as this object. If there is such a web page, EtE 
monitor appends the object to the page. 


e Otherwise, if there are any accessed web pages in 
the table, EtE monitor appends the object to the 
latest accessed one. 


If none of the above policies can be applied, EtE monitor 
drops the request. Obviously, the above heuristics may 
introduce some mistakes. Thus, EtE monitor also adopts 
a configurable think time threshold to delimit web pages. 
If the time gap between the object and the tail of the 
web page that it tries to append to is larger than the 
threshold, EtE monitor skips the considered object. In 
this paper, we adopt a configurable think time threshold 
of 4 sec. 


0.0 Identifying Valid Accesses Using 
Statistical Analysis of Access Pat- 
terns 


Although the above two-pass process can effectively pro- 
vide accurate web page access reconstruction in most 
cases, there could still be some accesses grouped incor- 
rectly. To filter out such accesses, we must better ap- 
proximate the actual content of a web page. 

All the accesses to a web page usually exhibit a set of 
different access patterns. For example, an access pattern 
can contain all the objects of a web page, while other pat- 
terns may contain a subset of them (e.g., because some 
objects were retrieved from a browser or network caches). 
We assume the same access patterns of those incorrectly 
grouped accesses should rarely appear repeatedly. Thus, 
we can use the following statistical analysis on access 
patterns to determine the actual content of web pages 
and exclude the incorrectly grouped accesses. 

First, from the Client Access Table created in Subsec- 
tion 5.2, EtE monitor collects all possible access patterns 
for a given web page and identifies the probable content 
template of the web page as the combined set of all ob- 
jects that appear in all the accesses for this page. Table 1 
shows an example of a probable content template. EtE 
monitor assigns an index for each object. The column 
URL lists the URLs of the objects that appear in the 
access patterns for the web page. The column Frequency 
shows the frequency of an object in the set of all web 
page accesses. In Table 1, the indices are sorted by the 
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occurrence frequencies of the objects. The column Ra- 
tio is the percentage of the object’s accesses in the total 
accesses for the page. 













Frequency | Rati 
aS a ee 


Table 1: Web page probable content template. There are 
3075 accesses for this page. 


Sometimes, a web page may be pointed to by sev- 
eral URLs. For example, hitp://www.hpl.hp.com and 
http://www.hpl.hp.com/indez.html both point to the 
same page. Before computing the statistics of the access 
patterns, EtE monitor attempts to merge the accesses 
for the same web page with different URL expressions. 
EtE monitor uses the probable content templates of these 
URLs to determine whether they indicate the same web 
page. If the probable content templates of two pages 
only differ due to the objects with small percentage of 
accesses (less than 1%, which means these objects might 
have been grouped by mistake), then EtE monitor ig- 
nores this difference and merges the URLs. 

Based on the probable content template of a web page, 
EtE monitor uses the indices of objects in the table to 
describe the access patterns for the web page. Table 2 
demonstrates a set of different access patterns for the 
web pagein Table 1. Each row in the table is an access 
pattern. The column Object Indices shows the indices 
of the objects accessed in a pattern. The columns Fre- 
quency and Ratio are the number of accesses and the 
proportion of the pattern in the total number of all the 
accesses for the web page. For example, pattern 1 is a 
pattern in which only the object index.html is accessed. 
It is the most popular access pattern for this web page: 
2271 accesses out of the total 3075 accesses represent this 
pattern. In pattern 2, the objects indez.himl, img1.gif 
and img2.gif are accessed. 
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Table 2: Web page access patterns. 





With the statistics of access patterns, EtE monitor 
further attempts to estimate the true content template 
of web pages, which excludes the mistakenly grouped ac- 
cess patterns. Intuitively, the proportion of these invalid 
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access patterns cannot be high. Thus, EtE monitor uses 
a configurable ratio threshold to exclude the invalid pat- 
terns (in this paper, we use 1% as a configurable ratio 
threshold). If the ratio of a pattern is below the thresh- 
old, EtE does not consider it as a valid pattern. In the 
above example, patterns 8 and 9 are not considered as 
valid access patterns. Only the objects foundinthe valid 
access patterns are considered as the embedded objects 
in a given web page. Objects 1, 2, and 3 define the 
true content template of the web page shown in Table 3. 
Based on the true content templates, EtE monitor filters 
out all the invalid accesses in a Client Access Table, and 
records the correctly constructed page accesses in the 
Web Page Session Log, which can be used to evaluate 
the end-to-end res ponse performance. 


Table 3: Web page true content template. 








6 Metrics to Measure Web Ser- 
vice Performance 


In this section, we introduce a set of metrics and the 
ways to compute them inorder to measure a web Ser vice 
efficiency. These metrics can be categorized as: 


e metrics approximating the end-to-end res ponse time 
observed by the client for a web page download. 
Additionally, we provide a means to calculate the 
breakdown between server processing and network- 
ing portions of overall response time. 


metrics evaluating the caching efficiency for a given 
web page by computing the server file hit ratio and 
server byte hit ratio for the web page 


e metrics relating the end-to-end performance of 
aborted web pages to the QoS. 


6.1 


We use the following functions to denote the critical 
timestamps for connection conn and request r: 


Response Time Metrics 


@ tsyn(conn): time when the first SYN packet from 
the client is received for establis hing the connection 
conn; 


® ee “CPD! time when the first byte of the request r 
is received ; 


© tes): time when the last byte of the request r is 


received; 
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© tresp (7): time when the first byte of the response 


for r is sent; 


® Gesn(?): time when the last byte of the res ponse for 


r 1s sent; 


e ¢7°* (7): time when the ACK for the last byte of the 


resp 
res ponse for r is received. 


Additionally, for a web page P, we have the following 
variables: 


e N - the number of distinct connections used to re- 
trieve the objects in the web page P; 


© nl - the requests for the objects retrieved 
through the connection conn, (k = 1,...,N), and 
ordered accordingly to the time when these requests 
were received, 1.e., 
6ee PY < 


req 


pee rs) a 


req — Se a | 
The extended version of HTTP 1.0 and later version 
HTTP 1.1 [9] introduce the concepts of persistent con- 
nections and pipelining. Persistent connections enable 
reuse of a single TCP connection for multiple object re- 
trievals from the same IP address. Pipelining allows a 
client to make a series of requests on a persistent connec- 
tion without waiting for the previous response to com- 
plete (the server must, however, return the responses in 
the same order as the requests are Sent), 

We consider the requests r*,...,7£ to belong to 
the same pipelining group (denoted as PipeGr = 
{rf,...,7*}) if for any j such that i < 7-1 <j <n, 
cae ink) < cone (at 0). 

Thus for all the requests on the same connection 
conn, rk,...,r®, we define the maximum pipelming 


>” ThE? 
groups in such a way that they do not intersect, e.g., 


k rk k 
+9 vr; ’ Ti+1 g99%9 Try 
PipeGr; PipeGr, PipeGr; 


For each of the pipelining groups, we define three por- 
tions of response time: total response time (Total), 
netw ork-related portion (Network), and lower-bound es- 
timate of the server processing time (Server). 

Let us consider the following example. For conve- 


nience, let us denote PipeGr, = {rf, ..., 7; }. 
Then 


Total(PipeGr;) = as terse (hE ) =~ Ge Ds 


Network(PipeGri) = . (t$08,(r *) — 

j=l 
Server(PipeGr;) = Total(PipeGr:) — Network(PipeGr). 
If no pipelining exists, a pipelining group only consists 
of one request. In this case, the computed server time 
represents precisely the server processing time for a given 


reap (T3))s 
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request-response pair. [f a connection adopts pipelining, 
the “real” server processing time might be larger than 
the computed server time because it can partially overlap 
the network transfer time, and it is difficult to estimate 
the exact server processing time from the packet-level 
information. However, we are still interested to estimate 
the “non-overlapping” server processing time as this is 
the portion of the server time on a critical path of overall 
end-to-end response time. Thus, we use as an estimate 
the lower-bound server processing time, which is explic- 
itly exposed in the overall end-to-end response. 

If connection conn, is a newly established connection 
to retrieve a web page, we observe additional connection 
Setup time: 


start 


Setup(connz) = treg (rt) —tsyn(conne) 2, 


otherwise the setup time is 0. Additionally, we define 


t**°™ (connk) = tsyn(connz) for a newly established con- 


nection, otherwise, t°“°"*(connk) = t66""(r}). 


Similarly, we define the breakdown for a given con- 
Nection conn,: 


Total(conn;,) = Setup(conny) + areo(Cg) -- tet), 


Network(conn,) = Setup(conng) + Ds Network(PipeGr;), 


j=1 


Server(conn,) = “3 Server(PipeGr;). 


wea 


Now, we define similar latencies for a given page P: 


Total(P) = ma rong (tins) - main t**°"*(conn;), 
N 
CumNetwork(P) = Me Network(conn;), 
j=} 
N 
CumServer(P) = » Server(conn;). 
j=l 


For the rest of the paper, we will use the term Et time 
interchangeably with Total{P) time. 

All the above formulae use t®?2(r) to calculate re- 
sponse time. An alternative way is to use as the end of 
a transaction the time (oor (a3) when the ACK for the 
last byte of the response is received by a server. Fig- 
ure 4 shows an example of a simplified scenario where a 
l-object page is downloaded by the client: it shows the 
communication protocol for connection setup between 
the client and the server as well as the set of major times- 
tamps collected by the EtE monitor on the server side. 
The connection setup time measured on the server side 
is the time between the client SYN packet and the first 
byte of the client request. This represents a close ap- 


proximation for the original client setup time (we present 
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EtEtime (ack) 


— 


ar EtE time (last byte} 


Round trip ime 
Setup(cosin) ee Sn 
tart tart id k 
fyn(COMM) — Neg (7) resp (7) tresp(?) tresp 
Server uume 
Client re 
SYN ACK is received response forr 
is sent request ris sent is received 


Client observed end-to-end time 


Figure 4: An example of a 1-object page download by the 
client: major timestamps collected by the EtE monitor on 
the server side. 


more detail on this point in Section 8 when reporting our 
validation experiments). 

If the ACK for the last byte of the client response is 
not delayed or lost, eon) is a more accurate approxi- 
mation of the end-to-end response time observed by the 
client: it “compensates” for the latency of the first client 
SYN packet that is not measured on the server side. 
The difference between the two methods, i.e. EtE time 
(last byte) and EtE time (ack), is only a round trip time, 
which is on the scale of milliseconds. Since the overall 
response time is on the scale of seconds, we consider this 
deviation an acceptably close approximation. However, 
to avoid the problems with delayed or lost ACKs, EtE 
monitor determines the end of a transaction as the time 
when the last byte of a response is sent by a server. 

Metrics introduced in this section account for packet 
retransmission. However, if the retransmission happens 
on connection establishment (i.e. due to dropped SYNs), 
EtE monitor cannot account for this. 

The functions CumNetwork(P) and CumServer(P) 
give the sum of all the network-related and server pro- 
cessing portions of the response time over all connections 
used to retrieve the web page. However, the connections 
can be opened concurrently by the browser. To evaluate 
the concurrency impact, we introduce the page concur- 
rency coefficient ConcurrencyCoef(P): 


N 
5 i Total(conn; 

oncurrencyCoef(P) = Ste — 
@ To 


Using page concurrency coeffident, we finally compute 
the network-related and the service related portions of 
response time for a particular page P: 


Network(P) = CumNetwork(P)/ConcurrencyCoef(P), 
Server(P) = CumServer(P)/ConcurrencyCoe f(P). 


EtE monitor can distinguish the requests sent to a 
web server from clients behind proxies by checking the 


2The connection setup time as measured by EtE monitor does 
not include dropped SYNs, as discussed earlier in Section 4, 
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HTTP via fields. If a client page access is handled via the 
same proxy (which is typically the case, especially when 
persistent connections are used), EtE monitor provides 
correct measurements for end-to-end response time and 
other metrics, as well as provides interesting statistics 
on the percentage of client requests coming from proxies. 
Clearly, this percentage is an approximation, since not 
all the proxies set the via fields in their requests. Finally, 
EtE monitor can only measure the response time to a 
proxy instead of the actual client behind it. 


6.2 Metrics Evaluating the Web Service 
Caching Efficiency 


Real clients of a web service may benefit from the pres- 
ence of network and browser caches, which can signifi- 
cantly reduce their perceived response time. However, 
none of the existing performance measurement tech- 
niques provide any information on the impact of caches 
on web services: what percentage of the files and bytes 
are delivered from the server comparing with the total 
files and bytes required for delivering the web service. 
This impact can only be partially evaluated from web 
server logs by checking response status code 304, whose 
corresponding requests are sent by the network caches 
to validate whether the cached object has been modi- 
fied. If the status code 304 is set, the cached object is 
not expired and need not be retrieved again. 

To evaluate the caching efficiency of a web service, 
we introduce two metrics: server file hit ratio and server 
byte hit ratio for each web page. 

For a web page P, assume the objects composing the 
page are O,,...,O,. Let Size(O;) denote the size of ob- 
ject O; in bytes. Then we define NumFiles(P) = n and 
Size(P) = S75_1 Size(Oj). 

Additionally, for each access P?...,, of the page P, 
assume the objects retrieved in the access are O},..., Of. ; 


we define NumFiles(P? oe.) = ky and Size(P'....,) = 


pee Size(O}%). First, we define file hit ratio and byte hit 
ratio for each page access in the following way: 


FileHitRatio(Piecess) = NumFiles(Precess)/NumFiles(P), 


ByteHitRatio(P recess) = Size(Prccess)/Size(P). 
Let P} Pe 


t Paccessy +» Faccess be all the accesses to the page P 
during the observed time interval. Then 


ServerFileHitRatio(P) = a se FileHitRatio(P. cess); 


kiN 


ServerByteHitRatio(P) = wy Sy ByteHitRatio(Px cess): 


kSN 


The lower numbers for server file hit ratio and server 
byte hit ratio indicate the higher caching efficiency for 
the web service, i.e., more files and bytes are ser ved from 
network and client browser caches. 
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6.3. Aborted Pages and QoS 


User-perceived QoS is another important metric to con- 
sider in EtE monitor. One way to measure the QoS of 
a web service is to measure the frequency of aborted 
connections. However, such simplistic interpretation of 
aborted connections and web server QoS has several 
drawbacks. First, a client can interrupt HTTP transac- 
tions by clicking the browser’s “stop” or “reload” button 
while a web page is downloading, or clicking a displayed 
link before the page is completely downloaded. Thus, 
only a subset of aborted connections are relevant to poor 
web site QoS or poor networking conditions, while other 
aborted connections are caused by client-specific brows- 
ing patterns. On the other hand, a web page can be re- 
trieved through multiple connections. A client’s browser- 
level interruption can cause all the currently open con- 
nections to be aborted. Thus, the number of aborted 
page accesses more accurately reflects client satisfaction 
than the number of aborted connections. 

For aborted pages, we distinguish the subset of pages 
IIggg with the response time higher than the given 
threshold X gig (in our case, X gig = 6 sec). Only these 
pages might be reflective of the bad quality downloads. 
While a simple deterministic cut off point cannot truly 
capture a particular client’s expectation for site perfor- 
mance, the current industrial ad hoc quality goal is to de- 
liver pages within 6 sec [12]. We thus attribute aborted 
pages that have not crossed the 6 sec threshold to in- 
dividual client browsing patterns. The next step is to 
distinguish the reasons leading to poor response time: 
whether it is due to network or server-related perfor- 
mance problems, or both. 


7 Case Studies 


In this section, we present two simple case studies to 
illustrate the benefits of EtE monitor in assessing web 
site performance. The first site is the HP Labs exter- 
nal site (HPL Site), http://www. hpl.hp.com. Static web 
pages comprise most of this site’s content. We measured 
performance of this site for a month, from July 12, 2001 
to August 11, 2001. The second site is a support site 
for a popular HP product family, which we call Support 
Site. It uses JavaServer Pages [11] technology for dy- 
namic page generation. The architecture of this site is 
based on a geographically distributed web server cluster 
with Cisco Distributed Director [5] for load balancing, 
using “sticky connections”. We measure the site perfor- 
mance for 2 weeks, from October 11, 2001 to October 
25, 2001. 

Table 4 summarizes the two site’s performance at-a- 
glance during the measured period using the two most 
frequently accessed pages at each site. The average end- 
to-end response time of client accesses to these pages re- 
flects good overall performance. However in the case of 
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Figure 5: HPL site during a month: a) Number of all and aborted accesses to indez.html ; b) Approximated page size and 


average access size to indez.html. 


Metrics 

Y Et time 
% of accesses above 
6 sec 


Y% of aborted ac- 
cesses above 6 sec 


HPL 
urll url2 


Support| Support 
url] url2 — 
3.5 sec | 3.9 sec 2.6 Sac 
iis 
% of accesses from | 16.8% 19.8% 11.2% 11.7% 
EtE time from | 42 sec | 3 sec 4.5 sec | 3 sec 


60. 9 KB] 127 KB 100 ae 
Server 


byte hit 445° No 63.2% 5). 8% 44. 6% 
ratio 


Number of objects “Tf fafa 
Number of 6.5 
connections 
Table 4: At-a-Glance statistics for www.hpl.hp.com and sup- 
port site during the measured period. 


HPL 


etwork-vs-Server 
ratio in EtE time 
Page size 
Server file hit ratio 





HPL, a sizeable percentage of accesses take more than 
6 sec to complete (8.2%-8.3%), with a portion leading 
to aborted accesses (1.3%-2.8%). The Support site had 
better overall response time with a much smaller per- 
centage of accesses above 6 sec (1.8%-2.2%), and a cor- 
respondingly smaller percentage of accesses aborted due 
to high response time (0.1%-0.2%). Overall, the pages 
from both sites are comparable in size. However, the 
two pages from the HPL site have a small number of 
objects per page (4 and 2 correspondingly), while the 
Support site pages are composed of 32 different objects. 
Page composition influences the number of client con- 
nections required to retrieve the page content. Addi- 
tionally, statistics show that network and browser caches 
help to deliver a significant amount of page objects: in 
the case of the Support site, only 22.9%-28.6% of the 
32 objects are retrieved from the server, accounting for 
44.6%-52.8% of the bytes in the requested pages. As 
discussed earlier, the Support site content is generated 
using dynamic pages, which could potentially lead to a 
higher ratio of server processing time in the overall re- 





124 


General Track: 2002 USENIX Annual Technical Conference 


sponse time. But in general, the network transfer time 
dominates the performance for both sites, ranging from 
93.5% for the Support site to 99.7% for the HPL site. 

Given the above summary, we now present more de- 
tailed information from our site measurements. For 
the HPL site, the two most popular pages during the 
observed period were index.html and a page in the 
news section describing the Itanium chip (we call it ita- 
nium.html). 

Figure 5 a) shows the number of page accesses to in- 
dex.html, as well as the number of aborted page accesses 
during the measured period. The graph clearly reflects 
weekly access patterns to the site. 

Figure 5 b) reflects the approximate page size, as re- 
constructed by EtE monitor. We use this data to addi- 
tionally validate the page reconstruction process. While 
debugging the tool, we manually compare the content 
of the 20 most frequently accessed pages reconstructed 
by EtE monitor against the actual web pages: the EtE 
monitor page reconstruction accuracy for popular pages 
is very high, practically 100%. Figure 5 b) allows us to 
“see” the results of this reconstruction process over the 
period of the study. In the beginning, it is a straight 
line exactly coinciding with the actual page size. At 
hour mark 153, it jumps and returns to a next straight 
line interval at the 175 hour mark. As we verified, the 
page has been partially modified during this time inter- 
val. The EtE monitor “picked” both the old and the 
modified page images, since they both occurred during 
the same day interval and represented a significant frac- 
tion of accesses. However, the next day, the Knowledge 
Base was “renewed” and had only the modified page in- 
formation. The second “jump” of this line corresponds 
to the next modification of the page. The gap can be 
tightened, depending on the time interval EtE monitor 
is set to process. The other line in Figure 5 b) shows 
the average page access size, reflecting the server byte 
hit ratio of approximately 44%. 

To characterize the reasons leading to the aborted web 
pages, we present analysis of the aborted accesses to in- 
dez.html page for 3 days in August (since the monthly 
graph looks very “busy” onan hourly scale). Figure 6 a) 
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Figure 6: HPL site during 3 days: a) Number of all and aborted accesses to indez.html; b) End-to-end response times for 
accesses to indez.html; c) CDF of all and aborted accesses to indez.html sorted by the response time in increasing order. 
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Figure 7: HPL site during a month: a) end-to-end response times for accesses to index.html; b) number of resent packets in 


response. 


shows the number of all the requests and the aborted 
requests to inder.html page during this interval. The 
number of aborted accesses (662) accounts for 16.4% of 
the total number of requests (4028). 

Figure 6 b) shows the average end-to-end response 
time measured by EtE monitor for inder.html and the av- 
erage end-to-end response time for the aborted accesses 
to index.html on an hourly scale. The end-to-end re- 
sponse time for index.html page, averaged across all the 
page accesses, is 3.978 sec, while the average end-to-end 
response time of the aborted page accesses is 9.21 sec. 

Figure 6 c) shows a cumulative distribution of all 
accesses and aborted accesses to index.html sorted by 
the end-to-end response time in increasing order. The 
vertical line on the graph shows the threshold of 6 sec 
that corresponds to an acceptable end-to-end response 
time. Figure 6 c) shows that 68% of the aborted accesses 
demonstrate end-to-end response times below 6 sec. This 
means that only 32% of all the aborted accesses, which 
in turn account for 5% of all accesses to the page, ob- 
serve high end-to-end response time. The next step is to 
distinguish the reasons leading to a poor response time: 
whether it is due to network or server performance prob- 
lems, or both. For all the aborted pages with high re- 
sponse time, the network portion of the response time 
dominates the overall response time (98%-99% of the to- 
tal). Thus, we can conclude that any performance prob- 
lems are likely not server-related but rather due to con- 
gestion in the network (though it is unclear whether the 
congestion is at the edge or the core of the network). 

Figure 7 a) shows the end-to-end response time for ac- 


cesses to inder. html on an hourly scale during a month. 
In spite of good average response time reported in at-a- 
glance table, hourly averages reflect significant variation 
in response times. This graph helps to stress the ad 
vantages of EtE monitor and reflects the shortcomings 
of active probing techniques that measure page perfor- 
mance only a few times per hour: the collected test num- 
bers could vary significantly from a site’s instantaneous 
performance characteristics. 


Figure 7 b) shows the number of resent packets in the 
response stream to clients. There are three pronounced 
“humps” with an increased number of resent packets. 
Typically, resent packets reflect network congestion or 
the existence of some network-related bottlenecks. In- 
terestingly enough, such periods correspond to week- 
ends when the overall traffic is one order of magnitude 
lower than weekdays (as reflected in Figure 5 a)). The 
explanation for this phenomenon is that during week- 
ends the client population of the site “changes” signifi- 
cantly: most of the clients access the site from home us- 
ing modems or other low-bandwidth connections. This 
leads to a higher observed end-to-end response time and 
an increase in the number of resent packets (i.e., TCP is 
likely to cause drops more often when probing for the 
appropriate congestion window over a low-bandwidth 
link), These results again stress the unique capabili- 
ties of EtE monitor to extract appropriate information 
from network packets, and reflect another shortcoming 
of active probing techniques that use a fixed number of 
artificial clients with rather good network connections 
to the Internet. For site designers, it is important to 
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Figure 8: HPL site during a month: a) number of all accesses to itanium.html; b) percentage of accesses with end-to-end 


response time above 6 sec. 
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Figure 9: HPL site: a) server file hit ratio for itanium.html; b) server byte hit ratio for itanitum.html. 


understand the actual client population and their end- 
to-end response time and the “quality” of the response. 
For instance, when large population of clients have lim- 
ited bandwidth parameters, the site designers should 
consider making the pages and their objects “lighter 
weight”. 

Figure 8 a) shows the number of page accesses to ita- 
nium.html. When we started our measurement of the 
HPL site, the itanium.html page was the most popular 
page, “beating” the popularity of the main index.html 
page. However, ten days later, this news article started 
to get “colder”, and the page got to the seventh place 
by popularity. 

Figure 8 b) shows the percentage of accesses with end- 
to-end response time above 6 sec. The percentage of 
high response time jumps significantly when the page 
becomes “colder”. The reason behind this phenomenon 
is shown in Figure 9, which plots the server file hit and 
byte hit ratio. When the page became less popular, the 
number of objects and the corresponding bytes retrieved 
from the server increased significantly. This reflects that 
fewer network caches store the objects as the page be- 
comes less popular, forcing clients to retrieve them from 
the origin server. 


Figure 8 b) and Figure 9 explicitly demonstrate the 
network caching impact on end-to-end response time. 
When the caching efficiency of a page is higher (i.e., 
more page objects are cached by network and browser 
caches), the response time measured by EtE monitor is 
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lower. Again, active probing techniques cannot measure 
(or account for) the page caching efficiency to reflect the 
“true” end-to-end response time observed by the actual 
clients. 


We now switch to the analysis of the Support site. We 
will only highlight some new observations specific to this 
site. Figure 10 a) shows the average end-to-end response 
time as measured by EtE monitor when downloading the 
site main page. This site uses JavaServer Pages technol- 
ogy for dynamic generation of the content. Since dy- 
namic pages are typically more “compute intensive,” it 
has a corresponding reflection in higher server-side pro- 
cessing fraction in overall response time. Figure 10 b) 
shows the network-server time ratio in the overall re- 
sponse time. It is higher compared to the network-server 
ratio for static pages from the HPL site. One interest- 
ing detail is that the response time spike around the 127 
hour mark has a corresponding spike in increased server 
processing time, indicating some server-side problems at 
this point. The combination of data provided by EtE 
monitor can help service providers to better understand 
site-related performance problems. 


The Support site pages are composed of a large num- 
ber of embedded images. Two most popular site pages, 
which account for almost 50% of all the page accesses, 
consist of 32 objects. The caching efficiency for the site 
is very high: only 8-9 objects are typically retrieved from 
the server, while the other objects are served from net- 
work and browser caches. The site server is running 
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Figure 10: Support site during 2 weeks: a) end-to-end response time for accesses to a main page; b) network-server time 


ratio for the main page. 
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Figure 11: Support site during 2 weeks: a) connection setup time for the main page; b) an estimated percentage of end-to-end 
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Figure 12: Support site: daily analysis of 20 ASes with largest client clusters: a) number of different clients accessing the 


main page; b) corresponding end-to-end response time per AS. 


HTTP 1.0 server. Thus typical clients used 7-9 connec- 
tions to retrieve 8-9 objects. The ConcurrencyCoef (see 
Section 6), which reflects the overlap portion of the la- 
tency between different connections for this page, was 
very low, around 1.038 (in fact, this is true for the site 
pages in general). This indicates that the efficiency of 
most of these connections is almost equal to sequential 
retrievals through a single persistent connection. 

Figure 11 a) shows the connection setup time mea- 
sured by EtE monitor. We perform a simple computa- 
tion: how much of the end-to-end response time observed 
by current clients can be improved if the site server would 
run an HTTP 1.1 server, allowing clients to use just two 
persistent connections to retrieve the corresponding ob- 
jects from the site? In other words, how much of the 
response time can be improved by eliminating unneces- 
sary connection setup time? 


Figure 11 b) shows the estimated percentage of end- 
to-end response time improvement available from run- 
ning an HTTP 1.1 server. On average, during the ob- 
served interval, the response time improvement for url 
is around 20% (2.6 sec is decreased to 2.1 sec), and for 
url2 is around 32% (3.3 sec is decreased to 2.2 sec). 

Figure 11 b) reveals an unexpected “gap” between 
230-240 hour marks, when there was “no improvement” 
due to HTTP 1.1. More careful analysis shows that dur- 
ing this period, all the accesses retrieved only a basic 
HTML page using 1 connection, without consequent im- 
age retrievals. The other pages during the same interval 
have a similar pattern. It looks like the image directory 
was not accessible on the server. Thus, EtE monitor, 
by exposing the abnormal access patterns, can help ser- 
vice providers get additional insight into service related 
problems. 
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EtE monitor also provides the information about the 
client clustering by associating them with various ASes 
(Autonomous Systems). Figure 12 a) shows the 20 
largest client clusters by ASes. Figure 12 b) reflects 
the corresponding average end-to-end response time per 
AS. The information provides a useful quantitative view 
on response times to the major client clusters. It can 
be used for efficient site design when the geographically 
distributed web cluster is needed to improve site per- 
formance. Similarly, such information can be used to 
make appropriate decisions on specific content distribu- 
tion networks and wide-area replication strategies given 
a particular service’s client population. 

The ability of EtE monitor to reflect a site perfor- 
mance for different ASes (and groups of IP addresses) 
happens to be a very attractive feature for service 
providers. When service providers have special SLA- 
contracts with certain groups of customers, EtE monitor 
provides a unique ability to measure the response time 
observed by those clients and validate QoS for those con- 
tracts. 

Finally, we present a few performance numbers to re- 
flect the execution time of EtE monitor when processing 
data for the HPL and Support sites. The tests are run on 
a 550Mhz HP C3600 workstation with 512 MB of RAM. 
Table 5 presents the amount of data and the execution 
time for processing 10,000,000 TCP Packets. 


Duration, Size, and Exe- HPL site 
cution Time | , 
Duration of data collection 1 day 

= 8.94 GB 


9.6 MB 


157,200 
8.603 
B05 
12 min 44 sec | 17 min 41 sec 


Support site 


Collected data size 
Transaction Log size 35 MB 


Table 5: EtE monitor performance measurements. 


The performance of reconstruction module perfor- 
mance depends on the complexity of the web page com- 
position, For example, the Support site has a much 
higher percentage of embedded objects per page than 
the HPLabs pages. This “higher complexity” of the re- 
construction process is reflected by the higher EtE mon- 
itor processing time for the Support site (17 min 41 sec) 
compared to the processing time for the HPLabs site 
(12 min 44 sec). The amount of incoming and outgoing 
packets of a web server farm that an EtE monitor can 
handle also depends on the rate at which tcpdump can 
capture packets and the traffic of the web site. 


8 Validation Experiments 


We performed two groups of experiments to validate the 
accuracy of EtE monitor performance measurements and 
its page access reconstruction power. 
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In the first experiment, we used two remote clients re- 
siding at Duke University and Michigan State University 
to issue a sequence of 40 requests to retrieve a designated 
web page from HPLabs external web site, which consists 
of an HTML file and 7 embedded images. The total 
page size is 175 Kbytes. To issue these requests, we use 
httperf[16], a tool which measures the connection setup 
time and the end-to-end time observed by the client for 
a full page download. At the same time, an EtE monitor 
measures the performance of HPLabs external web site. 
From EtE monitor measurements, we filter the statis- 
tics about the designated client accesses. Additionally, 
in EtE monitor, we compute the end-to-end time using 
two slightly different approaches from those discussed in 
Section 6.1: 


e EtE time (last byte): where the end of a transaction 
is the time when the last byte of the response is sent 
by a server; 


e EtE time (ACK): where the end of a transaction 
is the time when the ACK for the last byte of the 
response is received. 


Table 6 summarizes the results of this experiment (the 
measurements are given in sec): 


Rttperf Ete monitor 
Conn | Kesp. ee | eeearen| EtE time 
Setup | time | Setup | (last byte) (ACK) 
0.953 


Client 


0.088_| 


Table 6: Experimental results validating the accuracy of 
EtE monitor performance measurements. 


The connection setup time reported by EtE monitor 
is slightly higher (14-15 ms) than the actual setup time 
measured by httpe7j, since it includes the time to not 
only establish a TCP connection but also receive the 
first byte of a request. The EtE time (ACK) coincides 
with the actual measured response time observed by the 
client. The EtE time (last byte) is slightly lower than 
the actual response time by exactly a round trip delay 
(the connection setup time measured by httperf repre 
sents the round trip time for each client, accounting for 
74102 ms). These measurements correctly reflect our 
expectations for EtE monitor accuracy (see Section 6.1). 
Thus, we have some confidence that EtE monitor accu- 
rately approximates the actual response time observed 
by the client. 

The second experiment was performed to evaluate the 
reconstruction power of EtE monitor. The EtE monitor 
with its two-pass heuristic method actively uses the ref- 
erer field to reconstruct the page composition and to 
build a Knowledge Base about the web pages and ob- 
jects composing them. This information is used during 
the second pass to more accurately group the requests 
into page accesses. The question to answer is: how de- 
pendent are the reconstruction results on the existence 
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of referer field information. If the referer field is not set 
in most of the requests, how is the EtE monitor recon- 
struction process affected? How is the reconstruction 
process affected by accesses generated by proxies? 

To answer these questions, we performed the follow- 
ing experiment. To reduce the incorrectness introduced 
by proxies, we first filtered the requests with via fields, 
which are issued by proxies, from the original Transac- 
tton Logs for the both sites. These requests constitute 
24% of total requests for the HPL site and 1.1% of total 
requests for the Support site. We call these logs filtered 
logs. Further, we mask the referer fields of all trans- 
actions in the filtered logs to study the correctness of 
reconstruction. We call these modified logs masked logs, 
which do not contain any referer fields. We notice that 
the requests with referer fields constitute 56% of the to- 
tal requests for the HPL site and 69% for the Support 
site in the filtered logs. Then, EtE monitor processes the 
filtered logs and masked logs. Table 7 summarizes the 
results of this experiment. 


Metrics HPL HPL Support] Suppor 
urll url2 urll url2 
Reconstructed page ac- | 36,402 | 17,562 | 17,601 | 11,310 
cesses (filtered logs) Fee 
EtE time (filtered logs) | 3.3sec | 4.1 sec | 2.4 sec 





reconstructed page ac- 


33,735 | 14,727 | 15,401 | 8,890 
_ cesses (masked logs) | 


E:tE time (masked logs) | 3.2 sec 3.6 sec 


Table 7: Experimental results validating the accuracy of EtE 
monitor reconstruction process for HPL and Support sites. 


The results of masked logs in Table 7 show that EtE 
monitor does a good job of page access reconstruction 
even when the requests do not have any referer fields. 
However, with the knowledge introduced by the ref- 
erer fields in the filtered logs, the number of recon- 
structed page accesses increases by 9-21% for the con- 
sidered URLs in Table 7. Additionally, we also find that 
the number of reconstructed accesses increases by 11.2- 
19.8% for all the considered URLs if EtE monitor pro- 
cesses the original logs without filtering either the via 
fields or the referer fields. The difference of EtE time 
between the two kinds of logs in Table 7 can be ex- 
plained by the difference of the number of reconstructed 
accesses. Intuitively, more reconstructed page accesses 
lead to higher accuracy of estimation. This observation 
also challenges the accuracy of active probing techniques 
considering their relatively small sampling sets. 


9 Limitations 


There are a number of limitations to our EtE monitor 
architecture. Since EtE monitor extracts HTTP transac- 
tions by reconstructing TCP connections from captured 
network packets, it is unable to obtain HTTP informa- 
tion from encrypted connections. Thus, EtE monitor is 
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not appropriate for sites that encrypt much of their data 
(e.g., via SSL). 

In principle, EtE monitor must capture all traffic en- 
tering and exiting a particular site. Thus, our software 
must typically run on a single web server or a web server 
cluster with a single entry/exit point where EtE moni- 
tor can capture all traffic for this site. If the site “out- 
sources” most of its popular content to CDN-based so- 
lutions then EtE monitor can only provide the measure- 
ment information about the “rest” of the content, which 
is delivered from the original site. For sites using CDN- 
based solutions, the active probing or page instrumenta- 
tion techniques are more appropriate solutions to mea- 
sure the site performance. A similar limitation applies to 
pages with “mixed” content: if a portion of a page (e.g., 
an embedded image) is served from a remote site, then 
EtE monitor cannot identify this portion of the page 
and cannot provide corresponding measurements. In this 
case, EtE monitor consistently identifies the portion of 
the page that is stored at the local site, and provides 
the corresponding measurements and statistics. In many 
cases, such information is still useful for understanding 
the performance characteristics of the local site. 


The EtE monitor does not capture DNS lookup times. 
Only active probing techniques are capable of measuring 
this portion of the response times. Further, for clients 
behind proxies, EtE monitor can only measure the re- 
sponse times to the proxies instead of to the actual 
clients. 


As discussed in Section 3, the heuristic we use to 
reconstruct page content may determine incorrect page 
composition. Although the statistics of access patterns 
can filter invalid accesses, it works best when the sample 
size is large enough. 


Dynamically generated web pages introduce another 
issue with our statistical methods. In some cases, there 
is no consistent content template for a dynamic web page 
if each access consists of different embedded objects (for 
example, some pages use a rotated set of images or are 
personalized for client profiles). In this case, there is a 
danger that metrics such as the server file hit ratio and 
the server byte hit ratio introduced in Section 6 may be 
inaccurate. However, the end-to-end time will be com- 
puted correctly for such accesses. 


There is ai additional problem (typical for server ac- 
cess log analysis of e-commerce sites) about how to ag- 
gregate and report the measurement results for dynamic 
sites where most page accesses are determined by URLs 
with client customized parameters. For example, an e- 
commerce site could add some client specific parameters 
to the end of a common URL path. Thus, each access 
to this logically same URL has a different URL expres- 
sion. However, service providers may be able to provide 
the policy to generate these URLs. With the help of the 
policy description, EtE monitor is still able to aggregate 
these URLs and measure server performance. 
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JO Conclusion and Future Work 


Today, understanding the performance characteristics of 
Internet services is critical to evolving and engineering 
Internet services to match changing demand levels, client 
populations, and global network characteristics. Exist- 
ing tools for evaluating web service performance typi- 
cally rely on active probing to a fixed set of URLs or on 
web page instrumentation that monitors download per- 
formance to a client and transmits a summary back to 
a server. This paper presents, EtE monitor, a novel ap- 
proach to measuring web site performance. Our system 
passively collects packet traces from the server site to 
determine service performance characteristics. We in- 
troduce a two-pass heuristic method and a statistical 
filtering mechanism to accurately reconstruct composi- 
tion of individual page and performance characteristics 
integrated across all client accesses. 

Relative to existing approaches, EtE monitor offers 
the following benefits: i) a breakdown between the net- 
work and server overhead of retrieving a web page, 1i) 
longitudinal information for all client accesses, not just 
the subset probed by a third party, iii) characteristics 
of accesses that are aborted by clients, and iv) quan- 
tification of the benefits of network and browser caches 
on server performance. Our initial implementation and 
performance analysis across two sample sites confirm the 
utility of our approach. We are currently investigat- 
ing the use of our tool to understand the client perfor- 
mance on a per-network region. This analysis can aid 
in the placement of wide-area replicas or in the choice 
of an appropriate content distribution network. Finally, 
our architecture is general to analyzing the performance 
of multi-tiered web services. For example, application- 
specific log processing can be used to reconstruct the 
breakdown of latency across tiers for communication be- 
tween a load balancing switch and a front end web server, 
or communication between a web server and the storage 
tier /database system. 
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Abstract 


The growing popularity of thin-client systems makes it important to determine the factors that govern the 
performance of these thinclient architectures. To assess the viability of the thinclient computing model, we 
measured the performance of six popular thin-client platforms—Citrix MetaFrame, Microsoft Terminal Services, 
Sun Ray, Tarantella, VNC, and KX—running over a wide range of network access bandwidths. We find that thin- 
client systems can perform well on web and multimedia applications in LAN environments, but the efficiency of the 
thin-client protocols varies widely. We analyze the differences in the various approaches and explain the impact of 
the underlying remote display protocols on overall performance. Our results quantify the impact of different 
approaches in display encoding primitives, display update policies, and display caching and compression techniques 


across a broad range of thin-client systems. 


1. Introduction 


In the last two decades, the centralized computing 
model of mainframe computing has shifted to the more 
distributed model of desktop computing. But as these 
personal desktop computers become ubiquitous in 
today's large corporate and academic organizations, the 
total cost of owning and maintaining them can become 
unmanageable. In response to this challenge, there is a 
growing movement to return to a more centralized and 
eaSier-to-manage computing strategy. The thin-client 
computing model is the embodiment of that movement. 

The goal of the thin-client model is to centralize 
computing resources, with all the attendant benefits of 
easier maintenance and cheaper upgrades, while 
maintaining the same quality of service for the end user 
that could be provided by a dedicated workstation. In a 
thin-client computing environment, end users move 
from full-featured computers to thin clients, lightweight 
machines primarily used for display and input and 
which require less maintenance and less frequent 
upgrades. Organizations then provide computing 
services to their end users’ thin clients from high- 
powered servers over a network connection. Server 
resources can be shared across many users, resulting in 
more effective utilization of computing hardware. 

While thin-client computing is reminiscent of the 
days of mainframe computing, today's users can no 
longer be satisfied by dumb terminals that only input 
and output ASCII text. Thin clients must be able to 
Support graphical computing environments effectively 
to meet the users’ demands. The key mechanism for 
achieving this is a remote display protocol that enables 


graphical displays to be served across a network to a 
client device, while all application logic is executed on 
the server. Using such a protocol, the client transmits 
user input to the server, and the server returns screen 
updates to the client. For some thin-client systems, no 
unrecoverable state is stored onthe client at all. 

Because of the potential cost benefits of thin- 
client computing, a wide range of thin-client platforms 
have been developed. Some are designed specifically 
for use over high-bandwidth local area networks, while 
others attempt to provide quality service over slow 
network connections. Some _ application service 
providers (ASPs) are even offering thin-client service 
over wie area networks such as the [nternet [3, 21]. 
The growing popularity of thin-client systems makes it 
important to analyze their performance, to assess the 
general feasibility of the thin-client computing model, 
and to compare various thin-client platforms and 
detennine the factors that govern their performance. 
However, while many thin-client platforms and 
protocols have been developed, most of these systems 
and their protocols are proprietary, and few of the 
vendors have provided’ detailed performance 
measurements for their own products or a cross- 
platform analysis against other vendors* products. 

To assess the viability of the thin-client computing 
model, we have measured the performance of thin- 
client computing platforms running over a wide range 
of network access bandwidths. We have characterized 
the design choices of umdealying remote display 
technologies and quantified the performance impact of 
these choices. We consdered a range of design choices 
as exhibited by six of the most popular thin-client 
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Platform Display Screen Updates C Client Caching 
Encoding 


Low-level 
graphics 


Citrix 
MetaFrame 
(ICA) 


Server-push, lazy 


Microsoft Low-level Server-push, lazy RLE 


ompression 
RLE 





Client 

Cache Size 

| 3 MB RAM, 

| Percent of 

disk (1% 
default) 

| 1.5 MB 





Max Client | Transport 
Display Protocol 


8-bit color* 


Glyphs, small TCP/IP 
bitmaps in 
memory; large 
bitmaps on disk 


Glyphs, small 





8-bit color TCP/IP 


RAM, 
10 MB disk 


bitmaps in 
memory; large 
bitmaps on disk 


Terminal 
Services 
(RDP) 
Tarantella 
(AIP) 


graphics 






Adaptively 
enabled, RLE 
and LZW at low 
bandwidths 
Hextile (2D 


Low-level 1024 objects 


graphics 


Server-push, eager or 
lazy depending on 
bandwidth, load 


Glyphs, pixmaps, 
files 


— 


AT&T VNC | 2D draw Client-pull, lazy Only local 
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updates between client | RLE) 


requests discarded 


primitives 


* Citrix MetaFrame XP offers the ontion of 24-bit color depth, but this was not available in time for our experiments. 


Table 1: Characteristics of thin-client platforms. 


platforms in use today: Citrix MetaFrame [5, 14], 
Microsoft Windows 2000 Terminal Services [6], AT&T 
Virtual Network Computing (VNC) [22, 32], Tarantella 
[24, 27], Sun Ray [26, 30], and X [25}. These platforms 
were chosen for their popularity, perfonnance, and 
diverse design approaches. 

We report the first quantitative measurements to 
examine the performance of such a broad range of thin- 
client architectures in various network environments. 
Because many thin-client systems are closed-source and 
proprietary, we employed slow-motion benchmarking 
[37], a novel non-intrusive measurement technique that 
addresses some of the fundamental difficulties in 
previous studies of thin-client performance. Our results 
show that thin-client computing can deliver good 
performance for web and multimedia applications, but 
performance varies widely among different thin-client 
platform designs. Our results show that a simple pixel- 
based remote display approach can deliver superior 
performance to more complex thin-client systems that 
are currently popular. We analyze the differences in the 
underlying mechanisms of various thin-client platforms 
and explain their impact on overall performance. 


This paper is organized as follows. Section 2_ 


details the experimental testbed and methodology we 
used for our study. Section 3 describes our 
measurements and performance results. Section 4 
discusses some related work. Finally, we present some 
concluding remarks and directions for future work. 


2. Experimental Design 


The goal of our research was to compare thin- 
client systems to assess their basic display performance 


General Track: 2002 USENIX Annual Technical Conference 


framebuffer 
(Copyrect) 





24-bit color TCP/IP 


24-bit color UDP/IP 


Sun Ray 2D draw Server-push, cager None Only local NWA 
X High-level Server-push, eager None Application / N/A 24-bit color | TCP/IP 
graphics toolkit-specific, 
usually none 






in various network environments. In our experiments, 
we used the following six versions of thin-client 
platforms: Citrix MetaFrame 1.8 for Windows 2000, 
Windows 2000 Terminal Services, Tarantella 
Enterprise Express II for Linux, AT&T VNC v3.3.2 for 
Linux, Sun Ray I for Solaris, and Xfree86 3.3.6 on 
Linux. In this paper, we also refer to these platforms by 
their remote display protocols, which are Citrix ICA 
(Independent Computing Architecture), Microsoft RDP 
(Remote Desktop Protocol), Tarantella AlP (Adaptive 
Internet Protocol), VNC, Sun Ray, and X, respectively. 
As summanzed in Table 1, these platforms span a range 
of differences in the encoding of display primitives, 
policies for updating the client display, algorithms for 
compressing screen updates, supported display color 
depth, and transport protocol used. To evaluate their 
performance, we designed an experimental testbed and 
various experiments to exercise each of the thin-client 
platforms on single-user web-based and multimedia- 
oriented workloads using slow-motion benchmarking as 
explained in Section 2.1. Section 2.2 describes the 
experimental testbed we used. Section 2.3 discusses the 
application benchmarks used in our experiments. 


2.1 Measurement Methodology 


To provide a more effective method for evaluating 
thin-client performance, we previously developed slow- 
motion benchmarking [37]. We _ developed this 
benchmarking technique in order to address the 
inadequacies in conventional benchmarks in measuring 
thin-client perfonnance. In tlin-client systems, the 
client display is often decoupled from the server-side 
application execution. In some systems, the screen 
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updates may be merged or even discarded in order to 
synchronize the display with the application logic. 
While these techniques allow the thin server to run the 
application without being constrained by the slow 
display update speed, they pose a unique challenge in 
benchmarking. Standard benchmarks designed for 
desktop systems cannot be used to provide accurate 
results when evaluating thin-client systems. Because 
the benchmark applications are executed on the thin 
server, independent of the client-side display updates, 
the benchmarks effectively only measure the server’s 
performance and do not accurately reflect the user’s 
experience at the client-side. A video playback 
benchmark, for example, would measure the frame rate 
as rendered on the server, but if many of the frames did 
not reach the client, the frame rate reported by the 
benchmark would give an exaggerated view of the 
system’s performance. While internal instrumentation 
may be an effective solution to this problem, many thin- 
client products are proprietary and closed-source, 
making it difficult to instrument them and obtain 
accurate results. Internal instrumentation can also add 
intrusive processing overhead. 

In slow-motion benchmarking, we use network 
packet traces to monitor the latency and data transferred 
between the client and the server, but we alter the 
benchmark application by inserting delays between the 
separate visual events, such as web pages or video 
frames, so that the display update for each event ts fully 
completed on the client before the server begins 
processing the next one. Then we process the network 
packet traces and use these gaps of idle time between 
events to break up the results on a per-event basis. This 
allows us to obtain the latency and data transferred for 
each visual event separately. We can then obtain overall 
results by taking the sum of these per-event results. The 
amount of the delay inserted depends on the application 
workload and platform being tested. The necessary 
length of delay can be determined by monitoring the 
network traffic and making the delays long enough to 
achieve a clearly demarcated period between all the 
visual events where client-server communication drops 
to the idle level. This ensures that each visual event is 
discrete and generated completely. 


2.2 Experimental Testbed 


To verify our results in a controlled network 
environment and to provide a basis for comparison, we 
constructed an_ isolated network testbed. Our 
experimental testbed consisted of seven machines, five 
of which were active for any given test. The testbed 
consisted of a network emulator machine, a packet 
monitor machine, two pairs of thin client/server 
systems, and a web server used for the web benchmark. 


The network emulator machine was a Micron Client 
Pro PC with two 10/I00BaseT NICs running The Cloud 
(29}, a network emulator that we used to adjust the 
network bandwidth between the client and server. For 
Our experiments, we considered the performance of 
thin-client systems over a range of network bandwidths, 
specifically 128 Kbps, 768 Kbps, 1.5 Mbps, 10 Mbps, 
and 100 Mbps, corresponding roughly to ISDN, DSL, 
Tl, 10BaseT, and 100BaseT, respectively. The packet 
monitor machine was a Micron Client Pro PC running 
Etherpeek 4 [33], a network traffic monitor that we 
used to obtain the measurements for slow-motion 
benchmarking. To ensure a level playing field, we used 
the same client/server hardware for all of our tests 
except when testing the Sun .Ray platform, which only 
runs on Sun machines. The features of each system are 
summarized in Table 2. As discussed in Section 3, the 
slower Sun client and server hardware did not affiect the 
lessons derived from our experiments. 

Unless otherwise stated, the video resolution of 
the client was set to 1024x768 with 8-bit color, as this 
was the lowest common denominator supported by all 
of the platforms. However, the Sun Ray client was set 
to 24-bit color, since the Sun Ray display protocol is 
based on a 24-bit color encoding. By default, 
compression and memory caching were left on for those 
platforms that used it, and disk caching was tumed off 
by default in those platforms that supported it. For each 
thin-client system, we used the server operating system 
that delivered the best performance for the given 
system; Terminal Services only runs on Windows. 
MetaFrame ran best on Windows. Tarantella, VNC, and 
X ran best on UNIX/Linux, and Sun Ray runs only on 
Solaris. 


2.3 Application Benchmarks 


To measure the performance of the thin-client 
platforms, we used two application benchmarks: a web 
benchmark for measuring web browsing performance, 
and a video benchmark for measuring video playback 
performance. The web and video benchmarks were 
used with the slow-motion benchmarking technique 
mentioned in Section 2.1] to measure thin-client 
performance effectively. We describe cach of these 
benchmarks below. 


2.3.1 Web Benchmark 


The web benchmark we used was based on the 
Web Text Page Load test from the Ziff-Davis i-Bench 
benchmark suite [10]. We first describe the original 1- 
Bench web benchmark and then discuss how it was 
modified for our experiments. The original i-Bench web 
benchmark loads a JavaScript-controlled sequence of 
54 web pages from the web benchmark server. 
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Role / Model Hardware OS / Window System Saitrma sot 


MS Win 2000 Professional 
Caldera OpenLinux 2.4, Xfree86 3.3.6, 
KDE 1.1.2 


PC Thin Client 
Micron Client Pro 


450 MHz Intel Pll 
| 128 MB RAM 
14.6 GB Disk 
10/100BaseT NIC 









Sun Thin Client 
Sun Ray | 


100 MHz Sun uSPARC Ilep 
8 MB RAM 
10/100BaseT NIC 


450 MHz Intel Pll 
128 MB RAM 
14.6 GB Disk 
10/100BaseT NIC 


450 MHz Intel Pll 
128 MB RAM 
14.6 GB Disk 

| 10/100BaseT NIC 
450 MHz Intel Pll 
128 MB RAM 
14.6 GB Disk 
2 10/100BascT NICs 


Packet Monitor 
Micron Client Pro 


Micron Client Pro 


PC Thin-Client Sener 
Micron Client Pro 
(SPEC95 — 17.2 int, 12.9 tp) 


333, MHz UltraSPARC Ili 
384 MB RAM 

9 GB Disk 

2 10/100BascT NICs 


450 MHz Intel PH 
128 MB RAM 
14.6 GB Disk 

2 10/100BaseT NICs 


Sun Thin-Client Server 
Sun Ultra-i0 Creator 3D 
(SPEC95 — 14.2 int, 16.9 fp) 








Network Simulator 
Micron Client Pro 





Table 2: Testbed machinc configurations. 


Normally, as each page downloads, a small script 
contained in each page starts off the subsequent 
download. The pages contain both text and bitmap 
images, with some pages containing more text while 
others contain more images. Some common elements 
appear on each page, including a blue left column, a 
white background, a PC Magazine logo and other small 
images. The JavaScript cycles through the page loads 
twice, resulting in a total of 108 web pages being 
downloaded during this test. When the benchmark ts 
run from a thin client, the thin server would execute the 
JavaScript that sequentially requests the test pages from 
the 1-Bench server and relay the display information to 
the thin client. For the web benchmark used in our tests, 
we modified the original i-Bench benchmark’s 
JavaScript call to introduce delays of several seconds 
between pages using the JavaScript, sufficient in each 
case to ensure that the thin client received and 
displayed each page completely and that there was no 
temporal overlap in transferring the data belonging to 
two consecutive pages. We used the packet monitor to 
record the packet traffic for each page, and then used 
the timestamps of the first and last packet associated 
with each page to determine the download time for each 


page. 
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Sun Ray OS 


MS Win 2000 Professional 


MS Win NT 4.0 Server SP6a 


MS Win 2000 Advanced Server 
Caldera OpenLinux 2.4, Xfrec86 3.3.6, 
KDE 1.1.2 


Sun Solaris 7 Generic 10654 1-08, 
OpenWindows 3.6.1, CDE 1.3.5 


MS Win NT 4.0 Server SP6a 


Citrix ICA Win32 Client 

MS RDP5 Client 

VNC Win32 3.3.3r7 Client 
SCO Tarantella Win32 Client 
Netscape Communicator 4.72 






AG Group's Etherpeck 4 


Ziff-Davis 1-Bench 1.5 
MS Internet Information Server 


Citrix MetaFrame 1.8 

MS Win 2000 Terminal Services 
AT&T VNC 3.3.3r7 tor Win32 
SCO Tarantella Express 

AT&T VNC 3.3.3r2 for Linux 
Netscape Communicator 4.72 


Sun Ray Server 1.2_10.d Beta 
Netscape Communicator 4.72 










Shunra Software The Cloud 1.1 


We used Netscape Navigator 4.72 as the web 
client for the web benchmark, as it is available on all 
the platforins in question. The browser's memory cache 
and disk cache were enabled but cleared before each 
test run. In all cases, the Netscape browser window was 
1024x768 in size, so the region being updated was the 
Same on each system. 


2.3.2 Video Benchmark 


The video benchmark program processes and 
displays an MPEGI video file containing a mtx of news 
and entertainment programming. We measured video 
performance by monitoring resulting packet traffic at 
two playback rates, | frames/second (fps) and 24 fps. 
Although no user would want to play video at 1 fps, we 
took the measurement at that frame rate in order to 
establish the reference data size transferred from the 
thin server to the client that corresponds to a "perfect" 
playback. To measure the normal 24 fps playback 
performance and video quality, we monitored the 
packet tratfic delivered to the thin client at this 
playback rate and compared the total data transferred to 
the reference data size. The video quality can then be 
quantified by the ratio of data transfer rate at the full 
frame rate of 24 fps to the transfer rate at the slow- 
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motion playback rate of | fps expressed in percent [37]. 
The ratio was computed as follows: 





DataTransferred(24 fps) / PlaybackTime(24 fps) 
IdealFPS(24 fps) 





DataTransferred(l fps) / PlaybackTime(t fps) 
IdealFPS(t fps) 


For the video benchmark, we used two different 
MPEG! players. We used Microsoft Windows Media 
Player version 6.4.09.1109 for the Windows-based thin 
clients and MpegTV version 1|.1 for the Linux/Solaris- 
based platforms. Both players were used wth non- 
video portions of the interfaces minimized so that the 
appearance of the playback application was similar 
across all platforms. In the mmimized mode, accessory 
components like progress bars, frame counters, or 
clocks were not dsplayed. The test video clip was 
34.75 seconds long and consisted of 834 352x240 pixel 
frames with an ideal frame rate of 24 fps. The total 
vidco file size was 5.11 MB. The thin server executed 
the video playback program to decode the MPEGI 
video then relayed the resulting display to the client. 


3. Experimental Results 


We ran the web and video benchmarks on each of 
the six thin-client platforms and measured their 
resulting performance under five network band widths. 
The web benchmark results are shown both in temns of 
latencies and the respective amounts of data transferred 
from server to client to illustrate both the overall user- 
perceived performance and the bandwith efficiency of 
the thin-client systems. The data transferred from client 
to server was not significant in any of our experiments. 

Section 3.1 discusses the results obtaimed for 
running the thin-client systems with their default 
configuration options as discussed in Section 2.2. 
Section 3.2 analyzes the impact of the underlying 
baseline remote display encodings. Section 3.3 
consders the impact of caching and compression 
mechanis ms on thin-client performance. 


3.1 Default Configurations 


The results of running the web benchmark on each 
of the thin-client systems wth the default settings are 
shown in Figure | through Figure 4. The results of 
running the video benchmark on each of the thin-client 
systems are shown in Figure 5 through Figure 8. For 
comparison purposes, we also show results for using 
the PC client connected directly through the network 
emulator to the web and video server to demonstrate the 


performance of a traditional “fat” client system for web 
browsing and streaming video, respectively. 


3.1.1 Web Performance 


Figure | shows the average download latency per 
page. Usability studies have shown that web pages 
should take less than one second to download for the 
user to enjoy an uninterrupted brows ing experience [16, 
17]. Using this metric, all of the thin-client systems 
delivered good performance over the 10 Mbps and 100 
Mbps LAN bandwidths with average web page 
latencies well under a second. Using the 100 Mbps 
bandwath, X and AIP are the fastest with average web 
page latencies of less than 300 ms while the other thin- 
client systems have average latencies of about 500 ms. 
Figure | shows that reducing the bandwidth had the 
biggest negative impact on X and Sun Ray. In contrast, 
Citrix ICA, Microsoft RDP, Tarantella AIP, and VNC 
were able to deliver sub-second average web page 
latencies over bandwidths as low as 768 Kbps, 
corresponding to DSL environments. However, none of 
the thin-client systems were able to deliver sub-second 
performance at 128 Kbps. Only the PC fat-client 
achieved sub-second performance across all bandwidths 
tested. The results indicate that thin-client systems can 
provide good web browsing performance in broadband 
or higher bandwidth network environments, but are not 
yet able to perform well in lower-bandwidth dialup 
modem and ISDN environments. 

The web performance of the systems at various 
band wiaths can be better understood by examming the 
average amount of data sent per web page shown in 
Figure 2. Since the visual quality is constant across all 
bandwidths as a result of slow-motion benchmarking, 
the amount of data transferred for each platform ts also 
essentially constant across all bandwidths, except for 
AIP. For AIP, the different data transfer amounts across 
various bandwidths is caused by adaptive compression 
mechanisms which we discuss further in Section 3.3. 

At higher bandwidths, there ts little correlation 
between the amount of data transferred and the average 
web page latency. The best performing thin-client 
systems at the LAN bandwaths were X and AIP, which 
sent far more data than the lesser perfonning ICA and 
VNC. X sent more data than any other thin-client 
system except Sun Ray at 100 Mbps, yet it achieved the 
best performance at this bandwidth At lower 
bandwidths, however, there is_ direct correlation 
bet ween the amount of data transferred and the average 
web page latency. ICA sends the least amount of data 
and has the best performance of all the thin-client 
systems when using the 128 Kbps_ network 
environment. As shown in Figure 2, ICA sends on 
average about 30 KB of data per page, only twice as 
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much data for its display updates compared to using 
HTTP with a PC fat-client. 

Figure 3 and Figure 4 show the network 
bandwidth and client and server CPU utilizations for 
the web benchmark. The utilization measurements 
shown do not include the idle time between web pages. 
Figure 3 shows that the stronger correlation between 
latency and data transfer efficiency at lower bandwidths 
is due to the network becoming the main bottleneck. 
When the average bandwidth utilization exceeds 85 
percent, the latency incurred for the thin-client systems 
generally increases beyond the one-second web page 
latency threshold. Figure 4 shows the client and server 
load when using the 100 Mbps network environment. 
The measurements show that, except for VNC, the 
clients were not heavily loaded during the web 
benchmark, indicating that the client CPU was not the 
primary bottleneck even at high bandwidths. In the case 
of VNC, the client does not rest much as it is constantly 
pulling from the server. The CPU utilization for the Sun 
Ray hardware client is not shown because there were no 
tools available to measure it. In general, the server CPU 
was more heavily loaded than the client CPU. AIP, 
which requires running a web server on the server, had 
the highest server CPU utilization and appears limited 
by server speed in a 100 Mbps network environment. 
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Figure 1: Average latency per page in the web benchmark 
with default settings at various network bandwidths. 
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Figure 3; Average bandwidth utilization while downloading 
pages in the web benchmark with default settings at various 
network bandwidths. 
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3.1.2 Video Performance 


Figure 5 shows the resulting video quality on each 
system for various network bandwidth environments. 
The video quality was quantified using the VQ formula 
discussed in Section 2.3.2. Unlike the web benchmark 
performance, several of the thin-client platforms, ICA, 
RDP, and VNC, deliver poor video quality even in the 
100 Mbps network environment. Only X, AIP, and Sun 
Ray deliver good video quality at the highest 
bandwidth. None of the platforms deliver reasonable 
video quality at lower network bandwidths. Figure 5 
shows that X, AIP, and Sun Ray all deliver over 90 
percent video quality at 100 Mbps, but that even the 
best of them degrades to only about 50 percent video 
quality at 10 Mbps. Sun Ray has a special color space 
convert display primitive that can be used to improve 
the video playback performance if the application is 
written to exploit the feature. The MpegTV application 
we used, however, was not written to do so. No video 
benchmark data is shown for Sun Ray at 128 Kbps, 
because Sun Ray could not play the entire clip without 
interruption due to the limited bandwidth. The PC fat- 
client provides good video quality even at 1.5 Mbps, 
but the video quality rapidly deteriorates at lower 
bandwidths. For all platforms, the video playback time 
was relatively constant across all bandwidths, taking 
about 35 seconds to play the entire video clip. 
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Figure 2: Average data transferred per page in the web 
benchmark with default settings at various network 
bandwidths. 


[BiCient OServer | 


CPU Utilization 





Piatforms 


Figure 4: Average client and server CPU utilization while 
downloading pages in the web benchmark with default 
settings at 100 Mbps. 
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Figure §: Video quality in the video benchmark with default 
settings at Various network bandwidths. 


The video performance of the various systems at 
different bandwidths can be better understood by 
examining the total data transferred. Figure 6 shows the 
amount of data transferred by each system at the normal 
playback rate of 24 fps and at the slow-motion playback 
rate of | fps. The | fps data transfer measurements 
show how efficiently each system encoded the display 
updates when all of the video frames were filly 
delivered and displayed on the client. Comparing the 24 
fps and | fps measurements, we see that all of the 
systems discard data at lower bandwidths to maintain a 
constant playback rate, resulting in lower video quality. 
ICA, RDP, and VNC even discard large amounts of 
data at 100 Mbps. Figure 6 also shows that the thin- 
client systems that performed the best on the video 
benchmark were also the least data efficient at encoding 
the display. AIP and X transferred roughly 70 MB to 
play back the video clip in 8-bit color and Sun Ray 
transferred roughly three times that amount to display 
in 24-bit color. These data transfer rates are comparable 
to sending raw pixels over the network for each 
352x240 pixel frame and more than ten times the 
transfer rate of MPEG streaming the 5.11 MB clip. 
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Figure 7: Average bandwidth utilization during video 
playback in the video benchmark with default settings at 
various network bandwidths. 
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Figure 6: Total data transferred in full-motion (24 fps) and 
slow-motion (1 fps) playback in the vidco benchmark with 
default settings at various network bandwidths. 


Comparing Figure 7 and Figure 3 shows that the 
average bandwidth consumption of the thin-client 
systems when running the video benchmark was much 
higher than when running the web benchmark. None of 
the platforms was bandwidth limited at 100 Mbps, even 
though half of the systems (ICA, RDP, and VNC) 
delivered poor video quality at that bandwidth. 
However, all of the three systems (X, AIP, and Sun 
Ray) that delivered good video quality at 100 Mbps 
consumed well over 10 Mbps of network bandwidth. 
As a result, bandwidth limitations were the primary 
bottleneck for these three systems at lower network 
bandwidths. For the other systems that failed to perform 
well even at 100 Mbps, Figure 8 indicates that none of 
the client or server systems had high CPU load except 
for VNC. We note that while none of the client and 
server average utilization measurements reached 100 
percent, there was high variability in the system loads 
with frequent peaks at 100 percent for VNC on the 
client-side, suggesting that VNC video performance 
appears to be limited by the client’s CPU speed. 

Our measurements of thin-client performance on 
the web and video benchmarks indicate that AIP, X, 
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Figure 8: Average client and server CPU utilization during 
video playback in the video benchmark with default settings 
at 100 Mbps. 
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and Sun Ray are more able to support a broader range 
of applications, particularly multimedia applications. 
The results also suggest that thin-client systems such as 
ICA, RDP, and VNC can be quite bandwidth efficient 
for web applications, but that these same mechanisms 
which lead to bandwidth efficiency may degrade the 
performance in multimedia video applications. 


3.2 Baseline Display Encoding Primitives 


To understand how the underlying design choices 
in thin-client systems impact their performance, we 
isolated the effiects that can be attributed to the basic 
display encoding primitives used. Four types of display 
encoding primitives are high-level graphics, low-level 
graphics, 2D draw primitives, and raw pixels. Higher- 
level display encodings are generally considered to be 
more bandwidth efficient, but may require more 
computational complexity on the client and may be less 
platform-independent. For instance, graphics primitives 
such as fonts require the thin-client system to separate 
fonts from images while using pixel primitives enable 
the system to view all updates as just regions of pixels 
without any semantic knowledge of the display content. 
X takes a high-level graphics encoding approach and 
supports a rich set of graphics primitives in its protocol. 
ICA, RDP, and AIP are based on lower-level graphics 
primitives that include support for fonts, icons, drawing 
commands as well as images. Sun Ray and VNC 
employ 2D draw primitives such as fills for filling a 
screen region with a single color or a two-color bitmap 
for common text-based windows. VNC can also be 
configured to use raw pixel encoding only, but none of 
the systems we considered used raw pixels by default. 

To examine the basic display encoding 
performance, we disabled all configurable caching and 
compression mechanisms and ran the benchmarks. For 
AIP, there was no option to disable caching. For VNC, 
the display compression could not be disabled because 
it is built into the default hextile display encoding used. 
For X and Sun Ray, the baseline and default 
configurations were the same as there were no caching 
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Figure 9: Average latency per page in the web benchmark 
with baseline settings at 100 Mbps. 
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and compression options. For comparison purposes, we 
also show measurements using the VNC raw pixel 
encoding (RAW), which essentially encodes display 
updates as just raw pixels. The caching and 
compression options for each platform are discussed in 
further detail in Section 3.3. Due to space constraints, 
and since performance at lower network bandwidths is 
strongly correlated with bandwidth efficiency, we 
simply present latency and data transfer measurements 
for experiments at 100 Mbps to illustrate the baseline 
display encoding performance for the _ various 
approaches. 


3.2.1 Web Performance 


Figure 9 and Figure 10 show the latency and data 
transfer measurements for the baseline performance of 
the thin-client systems running the web benchmark. In 
particular, we show results for running two versions of 
the web benchmark: one with all of the images 
displayed normally, and one with just text in which all 
of the images were removed and replaced with blank 
spaces of equal size. We employed both versions to 
compare how different thin-client mechanisms perform 
on graphics versus text-oriented media. 

We first discuss the baseline measurements with 
the standard benchmark content (both images and text). 
Figure 9 shows that the average web page download 
latencies are not much different than those with the 
default thin-client configurations discussed in Section 
3.1.1. We note that all of the systems fare much better 
than RAW, which results in unacceptable average web 
page latencies of over 4 seconds. ICA and RDP exhibit 
somewhat higher latencies using just the baseline 
display encoding primitives as opposed to the default 
configurations. X and AIP still deliver the lowest 
average web page download latencies. 

The more interesting measurements are in Figure 
10, which shows the average data transferred per web 
page at the baseline settings. Comparing with RAW, the 
results show that all of the other display encodings used 
are substantially more bandwidth efficient than sending 
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Figure 10: Average data transferred per page in the web 
benchmark with baseline settings at 100 Mbps. 
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raw pixels, in some cases by more than an order of 
magnitude. ICA, RDP, and AIP all send about the same 
amount of data, which is consstent with the fact that 
they all employ low-level graphics display encoding 
primitives. X, which employs the higher-level graphics 
primitives, surprisingly sends the most data among all 
the 8-bit color thin-client systems. Although both VNC 
and Sun Ray use 2D draw primitives, the amount of 
data sent in each case is quite different. While the VNC 
display encoding appears the most data efficient, it 
includes built-in compression so comparing its 
efficiency with the other systems without compression 
is not a far comparison On the other hand, Sun Ray 
uses 24-bit color, so comparing its efficiency with other 
8-bit systems is not entirely fair either. 

To account for the impact of different color depths 
on display encoding efficiency, we also measured the 
performance of X and VNC using 24-bit color, as these 
were the only platforms we used that could operate 
using either 8-bit or 24-bit color depth. As shown in 
Figure 10, both X and VNC send roughly three times as 
much data using 24-bit color as opposed to using 8-bit 
color. This suggests that to fairly compare Sun Ray 
with the other 8-bit color results, we should normalize 
the amount of data transferred by the pixel color depth, 
which would effectively reduce the amount of data Sun 
Ray transferred by a factor of three. The normalized 
Sun Ray data transfer measurements would then be 
better than X and only about 20 percent worse than 
IC A. Surprisingly, the use of smple 2D draw primitives 
results in data transfer requirements better than the 
high-level graphics X approach and not much different 
from the low-level graphics approach used by ICA, 
RDP, and AIP. Furthermore, Figure 9 shows that Sun 
Ray performs somewhat better than the 8-bit color ICA 
and RDP platforms despite providing a higher quality 
24-bit color display. 

Figure 9 and Figure 10 also show the latency and 
data transfer measurements for the performance of the 
thin-client systems running the text-only version of the 
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Figure 11: Video quality in the video benchmark with 
baseline Settings at 100 Mbps. 


web benchmark. These results suggest that the higher- 
level display encodings are more optimized to reduce 
the data transfer requirements of text content as 
opposed to image content. Figure 10 shows that the 
higher-level encodings used by ICA, RDP, AIP, and X 
were much more bandwidth efficient for text than the 
lower-level encodings used by Sun Ray and VNC. In 
particular, RDP reduced the amount of data sent for text 
to less than five percent of that for both images and 
text. Despite the large bandwidth savings for text 
content, the higher-level encoding systems do not 
provide the same degree of reduction in latency, as 
shown in Figure 9. Instead, Sun Ray demonstrates the 
largest percentage reduction in web page download 
latency despite having the smallest percentage 
reduction in the amount of data transfered when 
comparing image and text content to text-only content. 
This agaim demonstrates that at a high enough 
bandwidth, the encoding overhead rather than the 
amount of data gencrated is the primary factor in 
de termining the performance. 


3.2.2 Video Performance 


Figure 11 and Figure !2 show the video quality 
and data transfer measurements for the baseline 
performance of the thin-client systems. The video 
quality results shown in Figure [1 for the baseline 
display encoding configuration are quite similar to the 
results for the default configuration discussed in 
Section 3.1.2. All of the systems performed much better 
than RAW, which yielded poor video quality of less 
than 15 percent. X, AIP, and Sun Ray still deliver good 
video quality while ICA, RDP, and VNC deliver 
noticeably worse video quality. Although the video 
quality for [CA and RDP are similar to their respective 
performance with the default configurations, Figure 12 
shows that they send roughly twice as much data when 
just using the basic display encoding. 

To account for the impact of different color depths 
on display encoding efficiency, we again measured the 
performance of X and VNC using 24-bit color as well. 
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Figure 12: Total data transferred in full-motion (24 fps) 
and slow-motion (1 fps) playback in the vidco benchmark 
with baseline Settings at 100 Mbps. 
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As expected, both platforms send substantially more 
data using 24-bit color versus using 8-bit color. In 
addition, Figure 11 shows that when using 24-bit color, 
the video quality of VNC remains poor and the video 
quality of X decreases down to about 65 percent. When 
comparing among the 24-bit color platforms, Sun Ray 
clearly delivers the best video quality. 

An important lesson derived from the default and 
baseline video benchmark results is that the timing of 
display update can be just as important as how a display 
update is encoded. X, AIP, and Sun Ray employing an 
eager server-push display update model excelled in the 
video benchmark at 100 Mbps. AIP also uses a lazy 
model to adapt to lower bandwidths. When a rendering 
command is generated by the application, these thin- 
client systems immediately convert that command to 
the underlying display encoding primitives and send the 
display update to the client. The eager updates enable 
the server to keep up with the video application’s 
rendering commands and allow the server to take 
advantage of any semantic information that can be used 
from the rendering command. In contrast, ICA, RDP, 
and VNC employ a lazy display update model, in which 
multiple rendering commands are first buffered and 
then later merged before lazily sending the merged 
display updates to the client. For ICA and RDP, the 
updates are lazily sent at a server-defined rate. The 
problem is that the updates are not sent frequently 
enough for real-time video display, resulting in multiple 
video frames being merged and overwritten at the 
server and never displayed at the client. For VNC, the 
updates are lazily sent when the client requests them. 
Since the client running VNC 1s already heavily loaded, 
the client becomes a bottleneck in requesting the 
display updates, resulting in lost video frames that are 
merged and overwritten at the server before the client is 
able to generate the next display request. 


3.3 Caching and Compression 


Four of the six thin-client platforms tested employ 
some form of configurable caching or compression to 
improve system performance. ICA and RDP both 
employ run-length encoding compression and cache 
fonts and bitmaps in memory and on disk at the client. 
AIP also employs local client caching of display objects 
and uses an adaptive mechanism to progressively 
enable higher-degrees of compression as the availability 
of network bandwidth becomes limited. VNC has RLE 
compression built-in with its display encoding format 
and employs a very simple form of on-screen caching 
whereby the client can simply copy display data from 
one portion of the screen to another rather than 
requesting it from the server if the display data is 
already displayed on another portion of the framebuffer. 
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To examine the performance impact of caching 
and compression techniques, we measured _ the 
performance of the thin-client systems on the web and 
video benchmarks with various caching and 
compression configuration settings. We show results 
for ICA, RDP, AIP, and VNC. In Section 3.3.1 and 
3.3.2, we compare four configurations: (1) the baseline 
results firom Section 3.2 with all caching and 
compression options disabled, (2) all compression only 
options enabled, (3) all caching only options enabled, 
and (4) all caching and compression options enabled. In 
particular, for ICA and RDP which support both 
memory and disk caching, we enabled or disabled both 
caches together. In Section 3.3.3, we explore the disk 
and memory caching options of ICA separately in 
further detail. For AIP, there was no option to disable 
caching as mentioned in Section 3.2, so the AIP cache 
only and baseline and cases are the same and there was 
no compression only configuration tested. For VNC, 
the compression cannot be separately configured as it is 
part of the default hextile encoding used, so the VNC 
baseline and compression only cases are the same and 
there was no cache only configuration tested. 

Figure 13 through Figure 16 show the latency and 
data transfer measurements for running the web 
benchmark relative to the baseline performance of each 
system as reported in Section 3.2.1. We again show 
results for running the normal web benchmark with 
both images and text and the text-only version of the 
web benchmark. Figure 17 and Figure 18 show the 
video quality and slow-motion | fps data transfer 
measurements for running the video benchmark relative 
to the baseline performance of each system as reported 
in Section 3.2.2. 


3.3.1 Web Performance 


Figure 13 shows that using 100 Mbps bandwidth, 
there is no significant performance benefit due to 
caching and compression options in most of the thin- 
client systems. The most notable difference occurs for 
ICA with caching enabled. Surprisingly, enabling 
ICA’s cache increases the average web page latency by 
almost 40 percent over the baseline performance. 

Figure 14 shows that there was a substantial 
difference in the amount of data transferred for almost 
all platforms for diffierent caching and compression 
options. For all three platforms, ICA, RDP, and AIP, 
for which compression could be enabled or disabled, 
enabling compression resulted in a substantial reduction 
in the amount of data transferred, at least a factor of two 
in all cases. It must be noted that the effiect of AIP’s 
compression could not be isolated and directly 
compared with those of RDP and ICA, because its 
cache could not be disabled. But AIP seems to have a 
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large reduction in data transfer when its compression 1s 
engaged, which is most likely due to its use of both 
RLE and LZW compression as opposed to using only 
RLE compression for ICA and RDP. AIP, however, 
was adversely affected by the added processing 
overhead of using cache and compression at 100 Mbps. 
When compression was enabled, the latency increased 
by 13%. At higher bandwidths, where the network is 
not the bottleneck, it may be advantageous to reduce the 
processing overhead by holding back on compression 
even if it results in a larger amount of data. Since 
performance at lower bandwidths is directly related to 
the amount of data transferred, compression 1s 
beneficial for improving performance at lower 
bandwidths. 

Caching is also not always beneficial. Among the 
systems that provided the option to enable or disable 
caching, Figure 14 shows that enabling caching results 
in the largest reduction in data transferred for ICA. ICA 
shows almost a factor of three reduction in data transfer 
for just using caching, and yet results in a significant 
increase in the average web page latency. In other 
words, the overhead of ICA caching outweighs its 
benefits in high bandwidth network environments. On 
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Figure 13: Latency (cxpresscd as percentage relative to 
bascline) in the web benchmark at 100 Mbps with various 
cache and compression scttings. 
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Figure 15: Latency (expressed as percentage relative to 
baseline) in the web benchmark at 100 Mbps with various 
cache and compression settings and with only text content. 


the other hand, using caching with RDP and VNC 
resulted in very little difference in either latency or data 
transferred versus not using caching. For VNC, the on- 
screen cache contains only the current display data 
which does not provide sufficient history to be 
beneficial in reducing the amount of data that the server 
needs to send. However, the ineffectiveness of the 
cache for RDP is more surprising as its caching 
architecture is similar to ICA on the surface. Our results 
indicate that RDP’s caching mechanism may not be 
operating correctly at best or poorly designed at worst. 
Figure 14 and Figure 16 show that there was no 
reduction in data size due to RDP’s disk cache. 

Figure 15 and Figure 16 show the latency and data 
transfer measurements for various combinations of 
caching and compression for the thin-client systems 
running the text-only web benchmark. The results for 
running the text-only benchmark were generally similar 
to those for the normal web benchmark with both text 
and images. These results suggest that the caching and 
compression mechanisms have similar advantages and 
disadvantages for both the image and text content of the 
web benchmark. The one exception was for using 
caching with ICA. With the text-only content, the 
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Figure 14: Data transferred (expressed as percentage relative 
to baseline) in the web benchmark at 100 Mbps with various 
cache and compression settings. 
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Figure 16: Data transferred (expressed as percentage relative 


to baseline) in the web benchmark at 100 Mbps with various 
cache and compression settings and with only text content. 
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Figure 17: Video quality (expressed as percentage relative to 
baseline) in the video benchmark at 100 Mbps with various 
cache and compression settings. 


performance did not degrade when ICA’s caching was 
engaged as we saw with both text and images. 


3.3.2 Video Performance 


Figure 17 shows the video quality measurements 
for various combinations of caching and compression 
for the thin-client systems running the video benchmark 
at 100 Mbps. For RDP and VNC, there was little 
difference in the video quality for the various options. 
For ICA, the biggest difference again appeared with the 
use of caching, which resulted in a substantial decrease 
in video quality from roughly 50 percent to less than 5 
percent. For AIP, the use of compression reduced the 
VQ from over 90 percent to less than 30 percent. 

Figure 18 shows the 1 fps data transfer 
measurements for various combinations of caching and 
compression for the thin-client systems running the 
video benchmark. These measurements provide a 
quantitative comparison of the amount of data each 
system transferred when sending all of the video 
content to the client without discarding data. Just as for 
the web benchmark, for all three platforms, ICA, RDP, 
and AIP, for which compression could be enabled or 
disabled, enabling compression resulted in a substantial 
reduction in the amount of data. The data reduction was 
generally not as large for the video benchmark as for 
the web benchmark, reflecting the fact that the video 
content was not as compressible as the web content. 
More importantly, enabling compression can have a 
detrimental impact on video performance at LAN 
bandwidths, as in the case of AIP. Compression, 
however, could yield some benefit at lower bandwidths 
due to its ability to reduce the amount of data 
transferred. Unlike the other thin-client systems, AIP 
employs an adaptive mechanism for enabling 
compression that turns compression off at high 
bandwidths and on at low bandwidths. Our results 
suggest that an adaptive mechanism for enabling 
compression at lower bandwidths is useful in trading 
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Figure 18: Total data transferred (expressed as percentage 
relative to baseline) in slow-motion (1 fps) playback in the 
video benchmark at 100 Mbps with various cache and 
compression settings . 


off compression overhead versus bandwidth savings at 
different bandwidths. 

As in the case of the web benchmark, caching did 
not consistently reduce the amount of data transferred 
for the video benchmark. Among the systems that 
provided the option to enable or disable caching, Figure 
18 shows that enabling caching reduced the amount of 
data transferred for ICA, but had no impact on the 
amount of data transferred for RDP or VNC. Just as 
with the web benchmark, the video benchmark results 
indicate that the overhead of ICA caching outweighs its 
benefits in high bandwidth network environments. 


3.3.3 Memory versus Disk Caching 


Thin-client systems may implement a hierarchical 
caching architecture with multiple levels of cache. In 
ICA, two forms of client caching are applied to improve 
the performance: caching in client memory and caching 
in client disk. These two forms of caching may have 
very different characteristics. Memory caching can 
provide much faster access times to smaller caches 
while disk caching can provide larger amounts of local 
cache with relatively slower access times. ICA provides 
both memory and disk caching as well as the ability to 
enable and disable each cache independently. We 
investigated the impact of memory and disk caching 
techniques by running the web and video benchmarks 
using ICA with various cache configurations. We 
considered all possible combinations of memory and 
disk caching, both with and without compression 
enabled. For the ICA disk cache, the maximum cache 
space and the minimum cacheable bitmap size are user- 
configurable. For our tests, the disk cache size was set 
to 39 MB, and the minimum cacheable bitmap size to 
8KB. The memory cache size was 8 MB. These disk 
and memory cache settings were default in the ICA 
client. 
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Figure 19 through Figure 22 show the 
performance of ICA _ with various cache and 
compression combinations available for ICA. As 
discussed 1n Section 2.3.1, the web benchmark cycles 
through 54 web pages twice. We call the first iteration 
Run! and the second Run 2. In order to highlight the 
effects of caching and compression, we present the 
performance relative to the baseline configuration as 
well as the performance ratio of Run 2 to Run |. If 
enough elements are cached while displaying the 
content from Run 1, we would expect the Run 2 to 
produce less data transferred from server to client and 
potentially yield a better performance. Also, if some 
elements are displayed repeatedly within the 54-page 
Iteration, then we would expect the transferred data 
amount to decrease in Run I as well as Run2. 

While there is no tool available to us to directly 
measure the cache hit/miss rate reliably for ICA, it 
would be reasonable to assume the ratio of data 
transferred from the server to the client with cache 
tumed on to that with cache off provides a rough 
measure of the cache miss rate. As shown In Figure 20, 
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Figure 19: The latency in Run | and Run 2 of the web 
benchmark at 100 Mbps with various cache and compression 
settings in ICA. The Run | and Run 2 latency are expressed 
as percentage relative to baseline as well as relative to one 
another. 
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Figure 21: The fatency in Run | and Run 2 of the web 
benchmark at 100 Mbps with various cache and compression 
settings in ICA and with only text content. The Run | and 
Run 2 latency are expressed as percentage relative to baseline 
as well as relative to one another. 


Run | of the benchmark run at 100 Mbps with disk 
cache on produced 77% of the data generated by Run | 
with the baseline configuration. That is, the client was 
forced to fetch 77% of the total display data from the 
server even with the disk cache on, presumably because 
the data wasn’t found in the local cache. In Run 2, 
however, the data ratio drops to 48%. As expected, 
more data was found in the local cache in Run 2. 
Inferring from Figure 2, the first iteration of 54 pages 
would yield only 1.6 MB of data, which would fit well 
within the cache. However, not all of the elements were 
cached even though the 39 MB disk cache had enough 
Capacity to store all objects encountered in Run 1. In 
particular, the bitmap objects smaller than 8 KB were 
not cacheable per the disk cache setting we used. 
Comparing the relative data size and latency 
between Run | and Run 2, it 1s evident that the memory 
cache serves to handle small elements, while the disk 
cache is used for caching large bitmaps. Figure 19 
shows that there is less significant improvement in 
latency in Run 2 compared to Run I with memory 
cache engaged. Figure 20 shows that there is almost no 


BRun1 vs. Baseline Runt WMRun2 vs. Baseline Run2 GRun2-to-Run 





120% 

100% 
o 2% 
a — © 
Eo Teale 

Ha HHH 

2 - ii i |) BL 


Gecbre MenCame DakCate BohCache Comes Corpensd Cored Cororess & 
MenCarte OekCache BonCeche 


Test Configurations 
Figure 20: The data size in Run 1 and Run 2 of the web 
benchmark at 100 Mbps with various cache and compression 
settings in ICA. The Run | and Run 2 data sizes are expressed 
as percentage relative to baseline as well as relative to one 
another. 
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Figure 22: The data size in Run | and Run 2 of the web 
benchmark at 100 Mbps with various cache and compression 
settings in ICA and with only text content. The Run 1| and 
Run 2 data sizes are expressed as percentage relative to 
baseline as well as relative to one another. 
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difference in data size between Run | and Run 2. While 
small graphical elements appear repeatedly throughout 
each cycle of 54 web pages, the large bitmaps seen in 
Run | only reappear when the same web page reappears 
in Run 2. If the memory cache cached larger objects, 
then we would expect to see a significant change in Run 
2 compared to Run [. With disk caching, however, we 
do observe such a change. The difference in the types 
of objects cached caused the two methods of cache to 
yield very different performance characteristics in our 
tests. 

A notable finding was that, at [00 Mbps, ICA 
performed worse whenever the disk cache was engaged 
even though the cache significantly reduced the amount 
of transferred data. As shown in the web benchmark 
results in Figure 19, the increase in latency with disk 
caching, relative to the baseline setting, was almost by a 
factor of two in Run 1. In Run 2, there was a slight 
improvement in performance with the disk cache 
engaged, but when accounting for both Run I and Run 
2, there was 44% higher latency overall. These data 
Suggest there 1s a heavy cache-miss penalty associated 
with ICA’s disk caching. At a high bandwidth like 100 
Mbps, the amount of time required to look up the cache 
becomes significant relative to the network access time. 

Figure 21 shows that the performance degradation 
due to disk caching does not occur in displaying text- 
only content, except when disk caching 1s used in 
combination with compression. The disk cache 1s 
primarily utilized for storing large bitmap objects. In 
the text-only test, no bitmap image is displayed during 
the benchmark run; therefore, we would expect the disk 
cache to have little to no effect. As seen in Figure 22, 
the disk cache does not contribute to any decrease in 
data size. We note that, in general, the data size 
increases slightly in Run 2 of the textonly tests 
compared to Run |, because Netscape for Windows, 
with its own cache engaged, behaves slightly differently 
in Run 2 compared to Run I in terms of the way the 
page 1s drawn. 

Memory caching, on the other hand, introduced no 
performance degradation. As shown in Figure 19, in 
both Run | and Run 2, the latency with memory cache 
engaged was less than the baseline latency. Figure 20 
Shows the transferred data size was roughly reduced to 
half relative to baseline. Although each of the 54 web 
pages is displayed for the first time in Run |, there are 
fonts, text, and small graphical elements (like the PC 
Magazine logo) that are repeated many times. With disk 
caching, any benefit in caching the repeated graphical 
elements was overwhelmed by the penalty in looking 
up the cache on the hard drive. The cache-miss penalty 
associated with the memory cache 1s much less se vere. 
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Figure 23: Average latency per page in the web benchmark 
for ICA with disk cache on and off. 


768K 


The lower thinclient performance with disk 
caching is due to the relative speed of the network 
compared with the disk and the penalty associated with 
cache misses. In a 100 Mbps LAN environment, the 
network speed is almost comparable in speed and 
bandwidth to the sustained performance of the local 
disk of our client machine. Consequently, obtaining 
display data from the disk cache is not necessarily 
faster than obtaining the data from the server across the 
100 Mbps network. In addition, with disk caching 
enabled, each disk cache miss requires the client to 
access the local disk as well as obtain the display data 
across the network. If local disk and network speeds 
were comparable, a cache miss would result in roughly 
twice as much latency as when the data were simply 
sent from the server without any disk caching. 

Figure 23 compares the performance of ICA at 
various network bandwidths with the default 
configuration settings versus the same settings except 
with the disk cache enabled. The results show that 
while disk caching adversely affects ICA performance 
at higher network bandwidths, it improves ICA 
performance at bandwidths below 768 Kbps. At low 
enough network bandwidths, the disk access time 
becomes insignificant relative to the network access 
time such that it is much faster to fetch data from the 
client disk cache than going across the nctwork to the 
server. For lower bandwidth networks, assuming 
reasonable cache hit rates, the benefit of smaller disk 
cache latencies on cache hits outweigh the penalty of 
extra disk cache latencies incurred on cache misses. 


4. Related Work 


Several studies have been conducted to evaluate 
thin-chent computing architectures. Danskin conducted 
an early study of the X protocol [7] by gathering traces 
of X requests. Citrix and Microsoft have conducted 
internal performance testing of their products. 
Microsoft has examined thin-client scalability issues in 
Terminal Services performance for the purposes of 
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capacity planning [15]. Schmidt, Lam, and Northcutt 
examined the perfonnance of the Sun Ray platform in 
comparison to the X protocol [26]. Wong and Seltzer 
have studied the performance of Windows NT Terminal 
Server and LBX [34, 35]. Tolly Research has conducted 
similar studies for Citrix MetaFrame [31]. Howard has 
measured the performance of various hardware thin 
clients using the 1-Bench benchmark suite [9], but his 
results suffer from methodology problems due to only 
measuring server-side application performance instead 
of user-perceived client-side performance. We have 
also. conducted’ earlier studies of  thin-client 
performance [18, 19, 36, 37], including previously 
developing the slow-motion benchmarking [37] used in 
this paper. Some of these studies have examined 
selected thin-client systems in detail via internal 
instrumentation. However, no study considered the 
performance of remote display mechanisms across the 
broad range of systems, system configurations, and 
network bandwidths discussed here. We have also 
further considered the performance of thin-client 
systems in wide-area network environments [12]. 

In addition to the thin-client systems discussed in 
this paper, a number of other systems for remote 
display have been developed. These include extensions 
to the systems considered such as low-bandwidth X 
(LBX) [1] and Kaplinsk's recent VNC tight encoding 
[11] as well as remote access solutions such as Laplink 
[13] and PC Anywhere [20]. Because of space 
constraints and previous work [18, 19] showing that 
LBX, Laplink, and PC Anywhere perform very poorly, 
we did not include them in this study. While thin-client 
systems have primarily been employed in LAN 
environments, a growing number of ASPs_ are 
employing thin-client technology to host desktop 
computing sessions that are remotely delivered over 
WAN environments. Examples include services from 
Charon Systems [3], Runaware [23], and Expertcity [8]. 


5, Conclusions and Future Work 


Our results show that thin-client systems can 
provide good performance for web and multimedia 
applications in LAN environments. Unlike traditional 
PC software environments, our results show that 
different thin-client system designs exhibit widely 
varying performance that can differ by orders of 
magnitude in some cases. Through our expen ments, we 
have analyzed various design choices underlying 
Current thin-client systems. Specifically, our 
measurements show three important conclusions 
regarding thin-client system design. 

First, higher-level graphics display primitives are 
not always more bandwidth efficient than lower-level 
display encoding primitives. X, which uses high-level 


graphics encoding consumed the most bandwidth in 
rendering the display at 8-bit color. Furthermore, 
higher-level primitives are often more optimized for 
text-oriented content, which will likely become a 
smaller and smaller percentage of display content as 
multimedia applications become increasingly popular. 

Second, the timing in sending display updates 
from the server to the client can be as important as how 
display updates are encoded. Our results indicate that 
an eager server-push model as used in X and Sun Ray 
provides better overall performance than lazy update 
models like ICA, RDP, and VNC, especially for 
multimedia video applications. While lazy update 
models may lead to some bandwidth savings by 
discarding or merging display updates, our results show 
that these techniques for optimizing bandwidth 
efficiency degrade the performance of multimedia 
applications even in high bandwidth environments. 

Third, display caching and compression are 
techniques which should be used with care as they can 
help or hurt thin-client performance. At higher 
bandwidths, ICA displayed significant performance 
degradation when caching was engaged, and AIP 
slowed down when its compression was forced on. Our 
results with current thin-client systems suggest that 
existing compression techniques provide a greater 
performance benefit than current caching mechanisms. 
Furthermore, adaptive use of these mechanisms based 
on the availability of network bandwidth as shown by 
AIP produces a _ good _ balance’ between the 
computational overhead of these encoding mechanisms 
and the potential bandwidth savings that they provide. 
In general, cutting down the processing time 1s 
desirable when there is enough network bandwidth, 
while reducing the amount of transferred data is 
beneficial at lower network speeds. 

Our results quantify the effectiveness of a number 
of thin-client design and implementation choices across 
a broad range of thin-client platforms and network 
environments. !n doing so, we provide the first 
comparative analysis of the performance of these 
systems. These measurements provide a basis for future 
research in developing more effective thin-client 
systems. 
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Abstract 


In this paper, we identify an emerging and impor- 
tant application dass comprised of a set of pro 
cesses on a cluster of devices communicating to a 
remote sect of processes on another cluster of de- 
vices across a common intermediary Internet path. 
We call these applications cluster-to-cluster applica- 
toons, or C-to-C applications. The networking re- 
quirements of C-to-C applications present unique 
challenges. Because the application involves com- 
munication between clusters of devices, very few 
streams will share a complete end-to-end path. At 
the same time, network performance needs to be 
measured globally across all streams for the appli- 
cation to employ interstream adaptation strategies. 
These strategies are important for the application 
to achieve its global objectives while at the same 
time realizing an aggregate flow bchavior that is 
congestion controlled and responsive. We propose 
a mechanisin called the Coordination Protocol (CP) 
to provide this ability. In particular, CP makes fine 
grained measurements of current network conditions 
across all associated flows and provides transport- 
level protocols with agsrcgate available bandwidth 
information using an cquation-based congestion con- 
trol algorithm. A prototype of CP is evaluated 
within a network simulator and is shown to be ef- 
fective. 


1 Introduction 


Advances in broadband networking, the emer- 
gence of information appliances (¢.g.. TiVo, PDA’s, 
HDTV, etc.), and the now ubiquitous computer pro- 
vide an environment rife with possibilities for new 
sophisticated multimedia applications that truly 
incorporate multiple media streams and _ intcrac- 
tivity. We believe many of these future In- 
ternet applications will increasingly make usc of 
multiple communication and computing devices in 


a distributed fashion. Examples of these ap- 
plications include distributed sensor arrays, tele- 
immersion [1:3], computer-supported collaborative 
workspaces (CSCW) {7], ubiquitous computing envi- 
ronments [16], and complex multi-stream, multime 
dia presentations [17]. In these applications, no one 
device or computer produces or manages all of the 
data streams transmitted. Instead, the endpoints of 
communication are collections of devices. We call 
applications of this type cluster-to-cluster applica- 
tions, or C-to-C applications. 

C-to-C applications share three important prop- 
erties: 


e They generate many independent, but scmanti- 
cally related, flows of data. 


e While very few flows within the application will 
share the exact same end-to-end path, all flows 
will share a common intermediary path between 
clusters. 


e This shared common path is the primary con- 
tributor of transmission dclay and the source 
of dynamic network conditions including loss. 
congestion, and jitter. 


Traditional multimedia applications like stream- 
ing video generate only a few media streams (c.g., 
audio and video) which in gencral originate and ter- 
minate at the same devices (c.g., media server to 
media cient). The applications we envision go far 
beyond this traditional model and include myriad 
flows of information of many different types comimnu- 
nicated between clusters of devices. 

Each flow of information may play a different role 
within the application and thus should be matched 
with a specific transport-level protocol which pro- 
vides the appropriate end-to-end networking bchav- 
ior. Furthermore, these flows will have complex se- 
mantic relationships which must be exploited by the 
application to appropriately adapt to changing net- 
work conditions and respond to user interaction. 
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The fundamental problem with current 
transport-level protocols within the C-to-C 
application context 1s their lack of coordina- 
tion. 


Application streams share a common intermediary 
path between clusters, and yet operate in isolation 
from one another. As a result, hows may compete 
with one another when network resources become 
limited, instead of cooperating to use available band- 
width in application-controlled ways. 

In this paper, we describe and evaluate a mech- 
anism that allows transport-level protocol coordi- 
nation of separate, but semantically related, flows 
of data. Our approach is to introduce mechanisms 
at the first- and last-hop routers which make mea- 
surements of current network conditions integrated 
across all flows associated with a particular C-to- 
C application. These measurements are then com- 
municated to the transport-level protocols on each 
endpoint. This enables a coordinated response to 
congestion across all flows that reflects application- 
level goals and priorities.) We leverage recent work in 
equation-based congestion control to ensure that the 
aggregate bandwidth used by all of the flows is TCP- 
friendly while allowing the application to allocate 
available bandwidth to individual flows in whatever 
manner suits its p urposes. 

The main contributions of this paper are: 


e Identification of the C-to-C class of Internet ap- 
plications, including a brief motivating exam- 
ple. 


e Description of the networking challenges unique 
to this application type 


e A proposal for a mechanism that provides 
transport-level protocol coordination in C-to-C 
ap plications. 


e Evaluation of several aspects of our mechanism 
using simulation. 


The rest of this paper is organized as follows: In 
Section 2, we present the C-to-C application model, 
describe a motivating example, and discuss network- 
ing requirements unique to this class of distributed 
applications. In Section 3, we review related work. 
We present our solution to the transport-level pr oto- 
col coordination problem in Section 4, and provide 
some exp crimental evaluation in Section 5. Section 6 
mentions future work, and Section 7 briefly summa- 
rizes the contents of this paper. 
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Figure 1: C-to-C application model. 


2 Motivation 


In this section, we describe in more detail the C-to- 
C application model, and illustrate it with a specific 
example. We then discuss the networking challenges 
associated with this application type, and why there 
is a need for a protocol coordination mechanism. 


2.1 C-to-C Application Model 


We model a generic C-to-C application as two sets of 
processes executing on two sets of communication or 
computing devices. Figure 1 illustrates this model. 

A cluster is comprised of a set of endpoints dis- 
tributed over a set of endpoint hosts (computers 
or communication devices) and a single aggregation 
point, or AP. Each endpoint is a process that sends 
and/or receives data from another endpoint belong- 
ing to a remote cluster. The AP functions as a gate 
way node traversed by all cluster-to-cluster flows. 
The common traversal path between aggregation 
points is known as the C-to-C data path. 


The AP is typically the first-hop router connect- 
ing the cluster to the Internet and the cluster end- 
points are typically on the same local area network. 
This configuration, however, is not strictly required 
by our model or our proposed mechanism. Our 
model is intended to capture several important char- 
acteristics of C-to-C applications. First, network- 
ing resources among endpoints of the same cluster 
are generally well provisioned for the needs of the 
application. Second, latency between endpoints of 
the same cluster is small compared to latency be 
tween endpoints on different clusters. Third, there 
exists a natural point within the network topology 
through which all cluster-to-cluster communication 
flows which can act as the AP. Finally, the C-to-C 
data path between AP’s is the main source of dy- 
namic network conditions such as jitter, congestion, 
and delay. Our overall objective is to coordinate 
endpoint flows across the C-to-C data path. 
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Figure 2: The Office of the Future 


2.2 An Example Application 


A concrete example of a C-to-C application may hel p 
clarify the types of applications we envision. In the 
Office of the Future, conceived by Fuchs et al. [13], 
tens of digital light projectors are used to make al- 
most every surface of an office (walls, desktops, etc.) 
a display surface. Similarly, tens of video cameras 
are used to capture the office environment from a 
number of different angles. At real-time rates, the 
video streams are used as input to stereo correda- 
tion algonthms to extract 3D geometry information. 
Audio is also captured from a set of microphones. 
The video streams, geometry information, and au- 
dio streams are all transmitted to a remote Office 
of the Future environment. At the remote environ- 
ment, the video and audio streams are warped us- 
ing both local and remote geometry information and 
stereo views are mapped to the light projectors. Au- 
dio is spatialized and sent to a set of speakers. Users 
within each Office of the Future environment wear 
shutter glasses that are coordinated with the light 
pro jectors. 

The result is an immersive 3D experience in which 
the walls of one office environment essentially disap- 
pear to reveal the remote environment and provide 
a tele-immersive collaborative space for the partic- 
ipants. Furthermore, synthetic 3D models may be 
rendered and incorporated into both display envi- 
ronments as part of the shared, collaborative experi- 
ence. Figure 2 is an artistic illustration of the appli- 
cation. A prototype of the application is described 
ihe 1B 

The Office of the Future is a good example of 
a C-to-C application because the endpoints of the 
application are collections of devices Two simi- 
larly equipped offices must exchange myriad data 


streams. While few streams (if any) will share a 
complete end-to-end communication path, all of the 
data streams will span a common shared path be- 
tween the local networking environments of each Of- 
fice of the Future. 

The local network environments are not likely to 
be the source of congestion, loss, or other dynamic 
network conditions because they can be provisioned 
to support the Office of the Future application. The 
shared Internet path between two Office of the Fu- 
ture environments, however, is not under local con- 
trol and thus will be the source of dynamic network 
conditions. 

The Office of the Future has a number of com- 
plex applicationltevel adaptation strategies that we 
believe are typical of C-to-C applications. One such 
strategy, for example, is dynamic interstream prior- 
itization. Since media types are integrated into a 
single immersive display environment, user interac- 
tion with any given media type may have implica- 
tions for how other media types are encoded, trans- 
mitted, and displayed. The orientation and posi- 
tion of the user’s head, for example, indicates a re- 
gion of interest within the office environment. Me- 
dia streams that are displayed within that region 
of interest should receive a larger share of available 
bandwidth and be displayed at higher resolutions 
and frame rates than media streams that are out- 
side the region of interest. When congestion occurs, 
lower priority streams should react more strongly 
than higher priority streams In this way, appro- 
priate aggregate behavior is achieved and dynamic, 
application-level tradeoffs are exploited. 


2.3 Networking Requirements of C- 
to-C Applications 


A useful metaphor for visualizing the networking 
requirements of C-to-C applications is to view the 
communication between dusters as a rope with 
frayed ends. The rope represents the ageregate data 
flow between dusters. Each strand represents one 
particular flow between endpoints. At the ends of 
the rope, each frayed strand represents a separate 
path between an endpoint and its local AP. The 
strands come together at the AP’s to form a single 
aggregate object. While each strand is a separate 
entity, they share a common fate and purpose when 
braided together. 

With this metaphor in mind, we identify several 
important networking requirements of C-to-C appli- 
cations: 


e Preserved end-to-end semantics. 
The transport-level protocol (i.e., TCP, UDP, 
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RTP, RAP, etc.) that is used by each flow is 
spedfic to the communication requirements of 
the data within the flow and the role it plays 
within the application. Thus, each transport:- 
level protocol should maintain the appropriate 
end-to-end semantics and mechanisms. For ex- 
ample, if a data flow contains control infor- 
mation that requires imorder, reliable deliv- 
ery, then the transport-level protocol used (c.g., 
TCP) should provide these services on an end- 
to-end basis. 


Global coordinated measurements of 
throughput, delay, and loss. 

The application is interested in overall perfor- 
mance which may involve complex interstream 
adaptation strategies in the face of changing 
network conditions. Throughput, delay, and 
loss should be ineasured across all flows assod- 
ated with the application as an aggregate. Fur- 
thermore, the behavior of individual transport- 
level protocols must reflect both the end-to-end 
semantics associated with the protocol as well 
as application-level adaptation strategies. To 
achieve this, we need to separate the adaptive 
dynamic behavior of each transport-level proto- 
col from the mechanisms used to measure cur- 
rent network conditions. 


TCP-friendliness. 

While the C-to-C application is free to pn- 
onitize how bandwidth is allocated among its 
streams, the total bandwidth used needs to be 
responsive to congestion. The emerging gold- 
standard for evaluating responsiveness is TCP- 
fnendliness. Intuitively, a flow of datais consid- 
cred TCP- friendly if it consumes as much band- 
width as a competing TCP flow consumes given 
the same network conditions. The advantage of 
using TCP-friendliness as a standard by which 
to measure the congestion response of a protocol 
(or in our case, the aggregate behavior of a set 
of protocols) is that it ensures “fairness” with 
the large majority of Internct traffic (including 
HTTP) that uses TCP as an underlying data 
transport protocol. 


Information about peer flows. 

Individual streams within the C-to-C applica- 
tion may require knowledge about other streams 
of the same application. This knowledge can 
be used to determine the appropriate adap- 
tive behavior given application-level knowledge 
about interstream relationships. For example, 
an application may want to establish a relation- 
ship between two flows of data such that one 
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flow consumes twice as much bandwidth as the 
other. 


e Flexibility for the application. 

A C-to-C application should be free to exploit 
trade-offs without constraint. That is, a coordi- 
nation mechanism should not preclude dynamic 
changes in bandwidth usage among flows, or 
enforce any particular scheme for establishing 
bandwidth usage relationships between flows. 
The application should be free to implement 
whatever adaptation policy is most appropriate 
in whatever manner is most appropnate. 


3 Related Work 


3.1 Application-level Framing 


The ideas of this paper are firmly grounded in the 
concept of Application Level Framing (ALF) [5]. 
The ALF principle states that networking mecha- 
nisms should be coordinated with application-level 
objectives. As explained above, however, C-to- 
C applications present unique challenges because 
these objectives involve interstream tradeoffs for 
flows that do not share a complete end-to-end 
path. The actions of heterogeneous protocols dis- 
tributed among a cluster of devices must be coordi- 
nated to incorporate application-specific knowledge. 
In essence. we are extending the ALF concept to 
the idea of adapting protocol bchavior to reflect 
application-level semantics. This idea is also well 
expressed in a position paper by Padmanabhan [11]. 


3.2 Protocol Coordination 


The coordination problem presented by C-to-C ap- 
plications is addressed most directly by Balaknsh- 
nan et al. in their work on the Congestion Manager 
(CM) [3, 1, 2]. CM provides a framework for differ- 
ent transport-level protocols to share information on 
network conditions, specifically congestion, thus al- 
lowing substantial performance improvements. We 
note, however. that CM flows share the same end-to- 
end path, while C-to-C flows share only a common 
intermediary path. The fact that C-to-C senders 
do not reside on the same host significantly lim- 
its the extensibility of the CM architecture to our 
problem context. CM offers applications sharing the 
same macroflow a system API and callback mecha- 
nisms for coordinating send events. Implementing 
this scheme using message passing between hosts is 
at best, problematic. 
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Furthermore, CM makes use of a scheduler to ap- 
portion bandwidth among flows. In [3], this is imple 
mented using a Hierarchical Round Robin (HRR_) al 
gorithm. We might extend this scheme to the C-to-C 
context by placing the scheduler at the AP. Doing so, 
however, results in several problems. First, packet 
buffering mechanisms are required which, along with 
scheduling, add complexity to the AP and hurt for- 
warding performance. Second. packet buffering at 
the AP lessens endpoint control over send events 
since endpoint packets can be queued for an indeter- 
minate amount of time. Balakrishnan et al. deliber- 
ately avoid buffering for exactly this reason, choos- 
ing instead to implement a scheduled callback event. 
Finally, scheduler configuration is problematic since 
C-to-C applications are complex and may continu- 
ally change the manner in which aggregate band 
width is apprortioned ainong flow endpoints. 


In [9], Kung and Wang propose a scheme for ag- 
gregating traffic betwecn two points within a back- 
bone network, and applying the TCP congestion 
control algorithm to the whole bundle. The mech- 
anisi is transparent to applications and does not 
provide a way for a particular application to make 
interstream tradeoffs. 

Pradhan et al. propose a way of aggregating TCP 
conncctions sharing the same traversal path in order 
to share congestion control information [12]. Their 
scheme takes a TCP connection and divides it into 
two separate (“implicit”) TCP connections: a “local 
subconnection” and a “reinote subconnection.” This 
scheme, however, breaks the end-to-end semantics of 
the transport protocol. 

{14] describes a scheme for sharing congestion in- 
forination across TCP flows froin different hosts. 
This work is similar to ours in that a mechanism 
is introduced within the network itself to coordi- 
nate congestion response across a number of differ- 
ent flows which may not share a complete end-to- 
end path. Their mechanism does not provide the 
application with information about flows as an ag- 
eregate, however, and focuses on optimizing TCP 
performance by avoiding slow-start, and detecting 
congestion as early as possible. 

Finally, Seshan et al. propose the use of perfor- 
mance servers that act as a repository for end-to- 
end performance information [15]. This informa- 
tion may be reported by individual clients or col 
lected by packet capture hosts, and then made avail- 
able to client applications using a query mechanism. 
The time granularity of performance information is 
coarse compared to CP, however, since it is intended 
to enable smart application decisions on connection 
type and destination, and not ongoing congestion 


Aggregation | 
Point 


| Aggregation 


Point Endpoint 


Endpoint 
Apptication Layer 


Transport Layer 


Coordination Layer 
Network Layer 





Pactkot Path 


Figure 3: CP network architecture. 


responsiveness. In addition, their work does not as- 
sociate heterogeneous flows belonging to the same 
application, or consider the performance of flow ag- 
gregates. 


3.3. Equation-based Congestion Con- 
trol 


TCP-friendly equation-based congestion control has 
recently matured as a technique for emulating TCP 
behavior without replicating TCP mechanics. In [6, 
10], an analytical model for TCP behavior is derived 
that can be used to estimate the appropriate TCP- 
friendly rate given estimates of vanous channel prop- 
erties. A number of important recommendations for 
using their TCP-fnendly equation-based congestion 
control have been docuinented in [8]. 


4 Coordination Protocol (CP) 


In this section we describe our solution to the prob- 
lem of transport-level protocol coordination in C-to- 
C applications. 


4.1 The Coordination Protocol (CP) 


We propose the use of a new protocol which oper- 
ates between the network layer (IP) and transport 
layer (TCP, UDP, etc.) that addresses the need for 
transport-level coordination. We call this protocol 
the Coordination Protocol (CP). The coordination 
function provided by CP is transport protocol in- 
dependent. At the same time, CP is distinct from 
network-layer protocols like IP that play a more fin- 
damental role in routing a packet to its destination. 

CP works by attaching probe information to pack- 
ets transmitted from one cluster to another. As ad- 
ditional probe inforination is returned along the re- 
verse cluster-to-cdluster data path, a picture of cur- 
rent network conditions is formed by the AP and 
shared among endpoints within the local cluster. A 
consistent view of network conditions across flows 
follows from the fact that the same information is 
shared among all endpoints. 
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Figure 3 shows our proposed network architecture 
from a stack implementation point of view. CP ex- 
ists on each endpoint device participating in the C- 
to-C application, as well as on the two aggregation 
points (APs) on either end of the cluster-to-cluster 
data path. Routers on the data path between APs 
need not be CP-enablecl since they examine only the 
IP header of each incoming packet in order to route 
the packet in their customary manner. 

The decision to insert CP between the network 
and transport layer rather than hanclling coordina- 
tion at the application level requires some justifica- 
tion. Of primary importance to us is the preserva- 
tion of encl-to-end semantics. An alternative would 
be for each endpoint to send to a multiplexing agent 
who would send the data, along with probe informa- 
tion, to a demultiplexing agent on the remote clus- 
ter. By breaking the communication path into three 
stages, however, the end-to-end semantics of indi- 
vidual transport-level protocols have been severed. 
Such a scheme would also mandate that application- 
level control is centralized and integrated into the 
multiplexing agent. 


Furthermore, we note that CP logically belongs 
between the network and transport layer. While the 
network layer handles the next-hop forwarding of 
individual packets and the transport layer handles 
the end-to-end semantics of individual streams, CP 
is concerned with streams that share a significant 
number of hops along the forwarding path but do 
not share the same end-to-end path. This relaxed 
notion of a stream bundle logically falls between the 
strict end-to-end notion of the transport-level and 
the independent packet notion of the network-level. 


Finally, placement of CP between the network and 
transport layer allows for greater efficiency. In an 
application-level implementation of CP, information 
on network conditions (e.g., round trip time between 
APs) must pass up through an endpoint’s protocol 
stack to the application layer. The information must 
then be passed back down to the transport layer 
where sending rate adjustments can be made in re- 
sponse to the information. In contrast, a distinct 
coordination layer allows for the information to be 
received and passed directly to the transport layer 
in a single pass as the incoming packet is processed 
by each layer of its endpoint’s network stack. 

While we acknowledge that implementing CP 
mechanisms at the application layer is indeed possi- 
ble, we believe there are distinct advantages to the 
approach we have chosen. We emphasize, however. 
that the relative merits or drawbacks of our scheme 
are merely implementation issues that should not 
obscure the fiindamental problem of C-to-C flow co- 
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ordination described in this paper. 


4.2 CP Packet Headers 


Figure 4 shows a CP data packet. CP encapsu- 
lates transport-level packets by prepending a 16- 
byte header and indicating in the protocol field 
which transport level protocol is associated with the 
packet. In turn, IP encapsulates CP packets and 
indicates in its protocol fidd that CP is being used. 

Each CP header contains an application identi- 
fier associating the packet with a particular C-to-C 
application, and a flow identifier indicating which 
flow from a given endpoint host the packet belongs 
to. The triple (application id, IP address, flow id) 
uniquely identifies each flow within the C-to-C ap- 
plication, and hence the source of each CP packet. 
The header also contains a version number and a 
flags field. 

The remaining contents of the CP header vary ac- 
cording to the changing role played by the header 
as it traverses the network path fiom source end- 
point to destination endpoint. As the packet passes 
from the source endpoint to its local AP, the header 
merely identifies the cluster application it is associ 
ated with and its sender. As the packet is sent from 
the source’s local AP to the remote AP, the header 
contains probe information used to measure round 
trip time, detect packet loss, and communicate cur- 
rent loss rate and bandwidth availability. As the 
packet is forwarded from the remote AP to its desti- 
nation endpoint, the header contains information on 
application bandwidth use, how membership, round 
trip time, loss rate, and bandwidth availability. 


4.3 Basic Operation 


The basic operation of CP is as follows. 


e As packets originate from source end- 
points. 
The CP header is included in the application 
packet indicating the source of the packet and 
the cluster application it is associated with. 


e As packets arrive at the local AP. 
CP will process the identification information 
arriving in the CP header, and note the packet's 
size and arrival time. Part of the CP header 
will then be overwritten, allowing the AP to 
communicate congestion probe information to 
the remote AP. 


e As packets arrive at the remote AP. 
The CP header is processed and used to de- 
tect network conditions. Again, part of the CP 
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Figure 4: CP packet structure. 


header is overwritten to communicate network 
condition information, along with information 
on cluster application size and bandwidth us- 
age, to the remote remote endpoint. 


e As packets arrive at the destination end- 
point. 
CP processes network condition information 
from the CP header and passes it on to the 
transport-level protocol and the application. 


4.4 State Maintained by an AP 


An AP maintains a table of active cluster applica- 
tions. each entry of which exists as soft state. When 
a packet arrives with an unknown cluster identifier 
in its CP header. a new entry will be created in the 
table and CP probe mechanisms will become active 
for that application. Similarly, if no CP packet has 
been seen for a particular cluster identifier 7, then 
the entry will time out and be removed from the 
application table. Use of soft state in this manner 
is both flexible and lightweight in that it avoids the 
need for explicit configuration and ongoing adminis- 
tration. 

For each cluster application. the AP monitors 
the number of partiapating flows. and the number 
and size of packets received during a given interval. 
Weighted averages are c:Uculated to dampen the ef- 
fect of packet bursts. The information is passed back 
to local cluster endpoints using the CP header when- 
ever a packet arrives from the remote AP on route to 
a local endpoint. If no such packet arrives within a 
specified time period, then a report packet is created 
and “pushed” to each endpoint informing them of 
cluster application membership and bandwidth us- 
age, as well as current network conditions. 

An AP also maintains probe state, including a cur- 
rent packet sequence number, estimated round trip 
time and mean deviation. a loss history and esti- 
mated loss rate, and a bandwidth availability calcu- 
lation. Use of these mechanisms is described below. 


4.5 Detecting Network Delay and 
Loss 


A primary function of CP is to measure network 
delay and detect packet loss along the cluster-to- 
cluster data path. Figure 5, Table 1, and Table 2 
together illustrate how information in the CP header 
is used to make these measurements. 

Each packet passing from one AP to another has 
several numbers inserted into its CP header. The 
first is a sequence number that increases monoton- 
ically for every packet sent. A remote AP may use 
this number to observe gaps (and reordering) in the 
aggregate flow of cluster application packets that it 
receives. In this way, it can detect losses and infer 
congestion. In our example, AP2 detects the loss of 
packet C when the sequence number received skips 
from 14 (packet A) to 16 (packet D). 


In addition, a timestamp is sent along with the se- 
quence number indicating the time at which the AP 
sent the packet. The remote AP will then echo the 
timestamp of the last sequence number received by 
placing the value in the CP header of the next packet 
traveling on the reverse path back to the sending AP. 
Along with this timestamp, a delay value will also 
be given indicating the length of time between the 
arrival of the sequence number at the AP and the 
time the AP transmitted the echo. 

By noting the time when a packet is recaved 
(Terrivat): the AP can calculate the round trip time 
as (Tusrival — Teche) — Tuciay. In our example, AP2 
receives packet B at time 280. The CP header con- 
tains the timestamp echo 60 and an echo delay value 
of 30. Thus, the round trip time is calculated as 
280 — 60 — 30 = 190. A weighted average of these 
round trip time calculations is used to dampen the 
effects of burstiness. 

Note that because sequence numbers in the CP 
header do not have any transport-level function, CP 
can use whatever C-to-C application packet is being 
transmitted next to carry this information. Since 
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Figure 5: Timeline of AP packet exchanges. 
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Table 1: Information in CP header for packets trav- 
eling from API to AP2 in Figure 5. 


the packets of multiple flows are available for this 
purpose, this mechanism can be used for fine-grained 
detection of network conditions along the cluster-to- 
cluster data path. 


We also observe that there is no one-to-one corre- 
spondence between timestamps sent and timestamps 
echoed betwcen APs. It may be the case that more 
than one packet is received by a remote AP before a 
packet traveling along the opposite path is available 
to echo the most current timestamp. The AP simply 
makes use of available packets in a best effort man- 
ner. In Figure 5 this can be seen as AP2 receives 
both packets B and D before packet E is available 
to send on the return path. Likewise, an AP may 
echo the same timestamp more than once if no new 
CP packet arrives with a new timestamp. In our 
example, this occurs when API sends packets B, C, 
and D with a timestamp echo value of 60 which it 
received from packet A. 
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Sequence 
Number 


A 76 620 40 

E 77 1020 60 
Table 2: Information in CP header for packets trav- 
eling from AP2 to AP1 in Figure 5. 











60 
460 





4.6 Calculating Loss Rate and Band- 
width Availability 


Calculation of loss rate and bandwidth availability 
make use of equation-based congestion control meth- 
ods described in Floyd et al. in their work on TCP- 
friendly rate control (TFRC) [8]. 

Loss rate, a central input parameter into the band- 
width availability equation, is calculated using a loss 
history and loss events rather than individual packet 
losses. By using a loss event rate rather than a sim- 
ple lost packet rate, we provide a more stable han- 
dling of lost packet bursts. The reader is referred 
to {6] for more details. 

Calculation of available bandwidth makes use of 
the equation: 


s 
I —————$———_——_————— 
Ry/ Se trro(3\/ 32)p(1 + 32p?) 


where X is the transmit rate (bytes/sec), s is the 
packet size (bytes), R is the round trip time (sec), p 
is the loss event rate on the interval [0,1.0], trro is 
the TCP retransmission timeout (sec), and b is the 
number of packets acknowledged by a single TCP 
acknowledgement. 

The resulting quantity, which we refer to as cur- 
rent bandwidth availability, is calculated at the re- 
mote AP, and then passed using the CP header to 
each endpoint in the cluster. Similarly, the event 
loss rate is also passed on to endpoints to inform 
them of current network conditions. 

We emphasize here that the use of the above equa- 
tion to calculate bandwidth availability for the dus- 
ter application makes the aggregate data flow from 
one AP to another TCP-compatible. 


4.7 ‘Transport-level Protocols 


Transport-level protocols at the endpoints are built 
on top of CP in the same manner that TCP is built 
on top of IP. CP provides these transport-level pro- 
tocols with a consistent view of network conditions, 
including aggregate bandwidth availability, loss rate, 
and round trip delay measurements. In addition, it 
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informs endpoints of the aggregate bandwidth us- 
age and the current number of flows in the cluster 
application. A transport-level protocol will in turn 
use this information, along with various configura- 
tion parameters, to determine a data transmission 
rate and related send characteristics. 

In Figure 3, we show several possible transport- 
level protocols (C-TCP, C-UDP, and C-RTP) which 
are meant to represent coordinated counterparts to 
existing protocols. A coordinated version of UDP 
(C-UDP) simply makes the above information avail- 
able directly to the application which may modify 
its sending rate according to an application-specific 
rule or bandwidth sharing scheme. 

A coordinated version of TCP (C-TCP) may con- 
sider acknowledgements only as an indicator of suc- 
cessful transfer. The burden of round trip delay de- 
termination and congestion detection can be rele- 
gated entirely to CP. Send rate adjustments at the 
transport level are the combined result of configu- 
ration information given by the application (e.g., a 
maximum sending rate), and information on current 
network conditions as provided by CP. 

While C-UDP and C-TCP represent adaptations 
of familiar transport-level protocols, we believe that 
other coordinated transport-level protocols are pos- 
sible. Such protocols will make use of CP infor- 
mation and application semantics to adjust sending 
rates to meet application-specific objectives. 


4.8 Application-level 
Interface 


Programming 


Endpoint implementations of CP provide a modcli- 
fied socket interface to the applicatien layer. With 
this interface, the application is able to associate its 
data flow with a particular cluster application and 
interact more directly with CP-related mechanisms 
In two ways. 

First, the application may use the interface 
to communicate configuration information to the 
transport-level. For example, an application may 
wish to restrict. the transport-level sending rate to 
no more than some maximum value. Or, an appli- 
cation may instruct the transport layer to send at 
only some fraction of the available bandwidth given 
various conditions. Such configuration is made pos- 
sible by a set of system calls which allow applications 
to pass functions to the transport layer which oper- 
ate on reported CP values in order to calculate an 
instantaneous sending rate. 

The application may also use the interface to ac- 
cess CP information directly. Thus, a system call is 
provided which allows the application to query, for 


example, available bandwidth, round trip time, and 
the current loss rate. Obtaining this information 
directly is of particular importance when the appli- 
cation itself controls its own send rate (e.g., C-UDP) 
rather than relegating such control to the transport- 
level protocol (e.g.. C-TCP). 


4.9 Endpoint Coordination 


While a goal of C-to-C applications is to maintain 
congestion responsiveness on an aggregate level, how 
this goal is realized is left entirely to the applica- 
tion. The approach of CP is to avoid the use of 
trafic shaping or packet scheduling mechanisms at 
the AP, but instead to provide application endpoints 
with bandwidth availability “hints” and other in- 
formation about changing network conditions. An 
application may then apportion bandwidth among 
endpoints by configuring them to respond to these 
hints in ways which meet the objectives of the ap- 
plication as a whole. 

For example, a C-to-C application may config- 
ure secondary streaming endpoints to reduce their 
sending rate, or stop sending altogether, in response 
to a drop in available bandwidth below a particu- 
lar threshold value. At the same time, a primary 
stream endpoint may continue to send at its original 
rate, and a control endpoint may increase its send- 
ing rate somewhat in order to transmit important 
commands telling the receive side how to respond 
to the change. Despite these differences in response 
behavior, the aggregate bandwidth usage drops ap- 
propriately to match the bandwidth availability hint 
given. 

CP provides a C-to-C application with the mech- 
anisms needed to make coordinated adaptation de- 
cisions which refiect the current state of the net- 
work and the application’s objectives. We believe 
it unnecessary to provide additional mechanisms 
which enforce bandwidth usage among endpoints 
since each belongs to the same application and thus 
shares the same objectives. In addition, endpoint 
configuration may be complex and change dynami- 
cally making the implementation of an enforcement 
scheme inherently problematic. 


5 Evaluation 


In this section, we evaluate the behavior of CP us- 
ing the network simulator ns-2 [4]. We focus here 
on our implementation of C-TCP, the coordinated 
counterpart to TCP. 

C-TCP, like TCP, implements reliability through 
the use of acknowledgement packets, tinieouts, and 
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Figure 6: Simulation testbed in ns2. 


retransmission. Unlike window-based TCP, how- 
ever, C-TCP is a rate-based implementation which 
adjusts its instantaneous send rate based on band- 
width availability information supplied by CP, and 
configuration information supplied by the applica- 
tion. Our implementation of C-TCP draws heavily 
from TFRC [8], except that loss and send rate cal- 
culations are handled by APs communicating over 
the C-to-C data path, and TCP-compatibility, as 
defined in [6], is achieved on an aggregate and not 
per- flow level. 


5.1 Network Topology 


Our simulation topology is pictured in Figure 6. A 
cluster of sending agents is labeled 5S, through S,, 
with its local aggregation point labeled APs. A re- 
mote cluster of ACK (acknowledgement) agents is 
labeled A, through A,,, with its aggregation point 
labeled AP. J; and Jo are intermediary nodes used 
to create a congested link, and 7; and T> are used 
for traffic generation. 

Propagation delay on links APs-/,, [-Iy, and Iy- 
AF, is configured to be 4 msec, while it is only 1 
msec on links S;-APs and A;~AP,. The link capac- 
ity for all links is 10 Mb/s, except for links T)-/, and 
T2-Ig where link capacity is 100 Mb/s. This allows 
traffic generators to increase traffic over link J;-f2 to 
any desired level 

Trace data is collected as it is transmitted from 
APs to J, since this allows us to observe sending 
rates before additional traffic on the link J,-J> causes 
queuing delays, drops, or jitter not reflective of clus- 
ter endpoint sending rates. 

TCP and C-TCP flows in this section use an in- 
finitely large data source and send at the maximum 
rate allowed by their respective algorithms. Conges- 
tion periods are created by configuring T, and T> 
to generate constant bitrate traffic across the link 
J;-{2. In particular, a CBR agent sending at a con- 
stant 7.5-9.0 Mb/s from T; to Ty competes with data 
traffic from S)-S, over link N-lo- 
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Figure 7: TCP flows competing for bandwidth dur- 
ing congestion. 


5.2 Behavior of Uncoordinated TCP 
Flows 


To better see the problem addressed by CP, we first 
examine how several TCP connections behave with- 
out coordination. In Figure 7, we see the throughput 
plot of three TCP connections as network congestion 
occurs between time 8.0 and 13.0 seconds. Flow 0 
belongs to an application process with higher band- 
width requirements than processes associated with 
flows 1 and 2. This can be seen clearly at the right 
and left edges of the plot when flow 0 takes its full 
share of the bandwidth under congestion-free cir- 
cumstances. 
We note the following observations: 


e During the congestion interval, all three flows 
compete with one another and receive a roughly 
similar portion of the available bandwidth. 


e The flows continue to compete in a similar fash- 
ion during the period directly afterward (time 
13.0 through 22.0) as each struggles to send ac- 
cumulated data and regain its requisite level of 
bandwidth. 


e The bandwidth used by each flow is character- 
ized by jagged edges, often criss-crossing one 
another. This makes sense since each flow op- 
erates independently. searching the bandwidth 
space by repeatedly ramping up and backing off. 


5.3 Behavior of C-TCP Flows 


We postulate here that use of the Coordination Pro- 
tocol (CP) should be distinctive in at lease two 
ways. First, since all hows make use of the same 
bandwidth availability calculation, round trip time, 
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Figure 8: C-TCP flows sharing bandwidth equally. 


and loss rate information, bandwidth usage patterns 
among CP fiows should be much smoother. That is, 
there should be far fewer jagged edges and less criss- 
crossing of individual flow bandwidths as flows need 
not search the bandwidth space in isolation for a 
maximal send rate. 

Second, the use of bandwidth by a set of CP fiows 
should reflect the priorities and configuration of the 
application-including intervals of congestion when 
network resources become limited. 

To test these hypotheses. we implemented three 
simple bandwidth sharing schemes which reflect dif- 
ferent objectives an application may wish to achieve 
on an aggregate level. We note here that more 
schemes are possible, and the mixing of schemes in 
complex, application-specific ways is an open arca of 
research. 

Figure 8 shows a simple equal bandwicith sharing 
scheme in which C-TCP flows divide available band- 
width (B) equally among themselves. (R; = B/N 
where &; is the send rate for sending endpoint 2, and 
N is the number of sending endpoints.) The aggre- 
gate plot line shows the total bandwidth used by the 
multi-flow application at a given time instant. While 
not plotted on the same graph, this line closely cor- 
responds to bandwidth availability values calculated 
by APs and communicated to cluster endpoints. 

Figure 8 confirm:s our hypothesis that usage pat- 
terns among CP flows should be far smoother, and 
avoid the jagged criss- crossing effect seen in Figure 7. 
This is both because flows are not constantly trying 
to ramp up in search of a maximal sending rate, 
and because of the use of weighted averages in the 
bandwidth availability calculation itself. The latter 
has the effect of dampening jumps in value from one 
instant to the next. 

Figure 9 shows a proportional bandwidth shar- 
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Figure 9: C-TCP flows sharing bandwidth propor- 
tional ly. 


ing scheme among C-TCP flows. In this particular 
scheme, flow 0 is configured to take .5 of the band- 
width (Rg = .5 * B), while flows 1 and 2 evenly 
divide the remaining portion for a value of .25 cach 
(R, = Re = .25 * Be 

Figure 9 confirms our second hypothesis above by 
showing sustained proportional sharing throughout 
the entire time interval. This includes the conges- 
tion intervals (times 5.0-8.0 and 14.0-20,0) and post- 
congestion intervals (times 8.0-10.0, 20.0-25.0) when 
TCP connections might still contend for bandwidth. 

In Figure 10, we see a constant bandwidth flow in 
conjunction with two flows equally sharing the re- 
maining bandwidth. The former is configured to 
send at a constant rate of 3.5 Mb/s or, if it is 
not available, at the bandwidth availability value 
for that given instant. (Ro = min(3.5Mb/s, B)). 
Flows 1 and 2 split the remaining bandwidth or, if 
none is available, send at a minimum rate of I1Kb/s. 
(y = Ro = maz((B — Ry)/2,1Kb/s)) 

We observe that flows 1 and 2 back off their send- 
ing rate almost entirely whenever flow 0 does not 
receive its full share of bandwidth. We also note 
that while flow 0 is configured to send at a constant 
rate, it never exceeds available bandwidth limita- 
tions during time of congestion. 

We emphasize once again the impossibility of 
achieving results like Figure 9 and Figure 10 in an 
application without the transport-level coordination 
provided by CP. 


5.4 TCP-Friendliness 


The TCP-friendliness of aggregate CP traffic is 
established by using the equation-based conges- 
tion control method described in [6] and used by 
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Figure 10: A constant bandwidth C-TCP flow with 
two C-TCP flows sharing remaining bandwidth. 


TFRC [8]. 

While equation-based rate control guarantees 
TCP-compatibility over long time intervals, Fig- 
ure 11 illustrates informally the behavior of a sin- 
gle C-TCP connection with two TCP connections 
during a short congested interval (time 5.0 through 
9.0). Here we're interested in verifying that the be- 
havior of the C-TCP flow does indeed appear to be 
compatible with that of the TCP flows. 

In general, we see that the C-TCP connection 
mixes reasonably well with the TCP connections, 
receiving approximately an equal share of the avail- 
able bandwidth. In addition, we once again observe 
the smoothness of its rate adjustments compared to 
the far more volatile changes in TCP flows. 


6 Future Work 


We believe transport-level protocol coordination in 
C-to-C applications to be fertile area for future work. 
In particular, much work remains to be done on 
new transport protocols better equipped to make 
use of network condition and cluster flow informa- 
tion. These protocols may provide end-to-end se- 
mantics which are more specific to an application's 
needs than current all-purpose protocols like TCP 
and UDP. 

Flow coordination in a C-to-C application within 
this paper has meant the sharing of bandwidth from 
a single bandwidth availability calculation, cquiva- 
lent to a single TCP-compatible flow. Future work 
might focus on sharing the equivalent of more than 
one TCP-compatible flow, just as many applications 
(eg., Web browsers) open more than one connection 
to increase throughput by parallelizing end-to-end 
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Figure 11: C-TCP flow interacting with TCP flows. 


communication. 

The assumption that local networks on each end 
of a C-to-C application can always be provisioned 
to minimize network delay and loss may not always 
be true. For example, wireless devices may introduce 
delay and loss inherent to the technology itself. How 
CP can be adapted to accomodate this situation is 
an area of future work. One idea is to use CP for 
distinguishing between congestion sources. End-to- 
end estimates of delay and loss could be compared 
with those of CP in order to determine whether con- 
gestion is local or within the network 

Finally, the impact of CP mechanisms on forward- 
ing performance at the AP is an important issue 
that deserves further study. We conjecture here that 
the impact will be modest since per-packet process- 
ing largely amounts to simple accounting and check- 
sum computations, and an AP avoids entirdy the 
need for buffering or scheduling mechanisms. An 
actual implementation is required, however, before 
any meaningful analysis can be done. 


7 Summary 


In this paper, we have identified a class of dis- 
tributed applications known as cluster-to-cluster (C- 
to-C') applications. Such applications have semanti- 
cally related flows that share a common intermediary 
path, typically between first- and last-hop routers. 
C-to-C applications require transport-level coordi- 
nation to better put the application in control over 
bandwidth usage, especially during periods when 
network resources become limited by congestion. 
Without coordination, high-priority flows may con- 
tend equally with low-priority flows for bandwidth, 
or receive no bandwidth at all, thus preventing the 
application from meeting its objectives entirely. 
We have proposed the Coordination Protocol 
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(CP) as a way of coordinating semantically re 
lated flows in application-controlled ways. CP oper- 
ates between the network (IP) and transport (TCP, 
UDP) layers. offering C-to-C flows fine-grained inf or- 
mation about network conditions along the cluster- 
to-cluster data path, as well as information about 
application flows as an aggregate. In particular, CP 
makes use of equation-based rate control methods to 
calculate bandwidth availability for the entire C-to- 
C application. This results in aggregate flow rates 
that are highly adaptive to changing network condi- 
tions and TCP-compatible. 
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Abstract 


Modern high-end disk arrays often have several giga- 
bytes of cache RAM. Unfortunately, most array caches 
use management policies which duplicate the same data 
blocks at both the client and array levels of the cache hi- 
erarchy: they are inclusive. Thus, the aggregate cache 
behaves as if it was only as big as the larger of the client 
and array caches, instead of as large as the sum of the 
two. Inclusiveness is wasteful: cache RAM IS expensive. 


We explore the benefits of a simple scheme to achieve 
etclusive caching, in which a data block is cached at ei- 
ther a client or the disk array, but not both. Exclusiveness 
helps to create the effect of a single, large unified cache. 
We introduce a DEMOTE operation to transfer data eject- 
ed from the client to the array, and explore its effective- 
ness with simulation studies. We quantify the benefits 
and overheads of demotions across both synthetic and 
real-life workloads. The results show that we can obtain 
useful-—sometimes substantial—speedups. 


During our investigation, we also developed some new 
cache-insertion algorithms that show promise for multi- 
client systems, and report on some of their properties. 


1 Introduction 


Disk arrays use significant amounts of cache RAM to 
improve perfiormance by allowing asynchronous read- 
ahead and write-behind, and by holding a pool of data 
that can be re-read quickly by clients. Since the per- 
gigabyte cost of RAM is much higher than of disk, cache 
can represent a significant portion of the cost of modem 
arrays. Our goal here is to see how best to exploit it. 


The cache sizes needed to accomplish read-ahead and 
write-behind are typically tiny compared to the disk ca- 
pacity of the array. Read-ahead can be efficiently han- 
dled with buffers whose size is only a few times the 
track size of the disks. Write-behind can be handled with 
buffters whose size is large enough to cover the variance 
~ (burstiness) in the write workload [32, 39], since the sus- 
tained average transfer rate is bounded by what the disks 
can support-—everything eventually has to get to stable 
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storage. Overwrites in the write-behind cache can in- 
crease the front-end write traffic supported by the array, 
but do not intrinsically increase the size of cache needed. 


Unfortunately, there is no such simple bound for the size 
of the re-read cache: in general, the larger the cache, the 
greater the benefit, until some point of diminishing re- 
turns is reached. The common rule of thumb 1s to try to 
cache about 10% of the active data. Table | suggests that 
this is a luxury out of reach of even the most aggressive 
cache configurations if all the stored data were to be ac- 
tive. Fortunately, this is not usually the case: a study of 
UNIX file system workloads [31] showed that the mean 
working set over a 24 hour period was only 3—7% of the 
total storage capacity, and the 90th percentile working 
set was only 6—16%. A study of deployed HP AutoRAID 
systems [43] found that the working set rarely exceeded 
the space available for RAIDI storage (about 10% of the 
total storage capacity). 


Both array and client re-read caches are typically oper- 
ated using the /east-recentl y-used (LRU) cache replace- 
ment policy []1, 12, 35]; even though many proprietary 
tweaks are used in array caches, the underlying algo- 
rithm is basically LRU [4]. Similar approaches are the 
norm in client-server file system environments [15, 27]. 


Interactions between the LRU policies at the client and 
array cause the combined caches to be inclusive: the ar- 
ray (lower-level) cache duplicates data blocks held in the 
client (upper-level) cache, so that the array cache is pro- 


| High-end artays 






| Svstem Cache | Disk space 
EMC 8830 64 Gib 70TB 
IBM ESS 32 GiB 27TB 
HP XP512 32 GiB 92TB 
a High-end servers - 
[Sister 
IBM 2900 ~ High-end (1-16) 
Sun E10000 High-end (4-64) 
HP Superdome | 128 GiB| High-end (8-64) 
HP rp8400 Mid-range (2-16) 
HP 1p7400 Mid-range (2-8) 





Table 1: Some representative maximum-supported sizes for 
disk arrays and servers from early 2002. 1 GiB = 2°" bytes. 
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viding little re-read benefit until it exceeds the effective 
size of the client caches. 


Inclusiveness 1s wasteful: it renders a chunk of the array 
cache similar in size to the client caches almost useless. 
READ operations that miss in the client are more likely 
to miss in the array and incur a disk access penalty. For 
example, suppose we have a client with 16 GB of cache 
memory connected to a disk array with 16 GB of re- 
read cache, and suppose the workload has a total READ 
working set size of 32 GB. (This single client, single 
array Case is quite common in high-end computer instal- 
lations; with multiple clients, the effective client cache 
size 1S equal to the amount of unique data that the clients 
caches hold, and the same arguments apply.) We might 
naively expect the 32 GB of available memory to capture 
almost all of the re-read traffic, but in practice it would 
capture only about half of it, because the array cache will 
duplicate blocks that are already in the client [15, 27]. 


To avoid these difficulties, it would be better to arrange 
for the combined client and array caches to be exclusive, 
so that data in one cache is not duplicated in the other. 


1.1 Exclusive caching 


Achieving exclusive caching requires that the client and 
array caches be managed as one. Since accesses to 
the client cache are essentially free, while accesses to 
the array cache incur the round-trip network delay, the 
cost of an I/O operation at the client, and the controller 
overheads at the array, we can think of this setup as a 
cache hierarchy, with the array cache at the lower level. 
These costs are not large: modem storage area networks 
(SANs) provide 1-2 Gbit/s of bandwidth per link, and 
I/O overheads of a few hundred microseconds; thus, re- 
trieving a 4 KB data block can take as little as 0.2 ms. 


However, it would be impractical to rewnte client O/S 
and array software to explicitly manage both caches. It 
would also be undesirable for the array to keep track of 
precisely which blocks are in the client, since this meta- 
data is expensive to maintain. However, we can approx- 
imate the desired behavior by arranging that the client 
(1) tells the array when it changes what it caches, and 
(2) returns data ejected from the upper-level cache to the 
lower-level one, rather than simply discarding tt. 


We achieve the desired behavior by introducing a DE- 
MOTE operation, which one can think of as a possible 
extension to the SCS] command set. DEMOTE works as 
follows: when a client is about to eject a clean block 
from its cache (e.g., to make space for a READ), it first 
tries to return the block to the array using a DEMOTE. 
A DEMOTE operation is similar to a WRITE operation: 
the array tries to put the demoted block into its re-read 
cache, ejecting anotherblock if necessary to make space. 
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Figure 1: Sample cache management schemes. The top and 
bottom boxes represent the client and array cache replacement 
queues respectively. The arrow in a box points to the end clos- 
est to being discarded. 


Unlike a WRITE, the array short-circuits the operation 
(1.e., it does not transfer the data) if it already has a copy 
of the block cached, or if it cannot immediately make 
space for it. In all cases, the client then discards the 
block from its own cache. 


Clients are trusted to return the same data that they read 
earlier. This is not a security issue, since they could eas- 
ily issue a WRITE to the same block to change its con- 
tents. If corruption is considered a problem, the array 
could keep a cryptographic hash of the block and com- 
pare it with a hash of the demoted block, at the expense 
of more metadata management and execution time. 


SANs are fast and disks are slow, so though a DEMOTE 
may incur a SAN block transfer, performance gains are 
still possible: even small reductions in the array cache 
miss rate can achieve dramatic reductions in the mean 
READ latency. Our goal is to evaluate how close we can 
get to this desirable state of affairs and the benefits we 
obtain from it. 


1.2 Exclusive caching schemes 


The addition of a DEMOTE operation does not in itself 
yield exclusive caching: we also need to decide what 
the array cache does with blocks that have just been de- 
moted or read from disk. This is primarily a choice of 
cache replacement policy. We consider three combina- 
tions of demotions with different replacement policy at 
the array, illustrated in figure 1; all use the LRU policy 
at the client: 


@ NONE-LRU (the baseline scheme): clients do no de- 
motions; the array uses the LRU replacement policy 
for both demoted and recently read blocks. 


e DEMOTE-LRU: clients do demotions; the array uses 
the traditional LRU cache management for both de- 
moted and recently read blocks. 


e DEMOTE: clients do demotions; the array puts 
blocks it has sent to a client at the head (closest to 
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being discarded end) of its LRU queue, and puts de- 
moted blocks at the tail. This scheme most closely 
approximates the effect of a single unified LRU 
cache. 


We observe that the DEMOTE scheme is more exclusive 
than the DEMOTE-LRU scheme, and so should result in 
lower mean latencies. Consider what happens when a 
client READ misses in the client and array caches, and 
thus provokesa back-end disk read. With DEMOTE-LRU, 
the client and array will double-cache the block until 
enough subsequent READs miss and push it out of one 
of the caches (which will take at least as many READs as 
the smaller of the client and array queue lengths). With 
DEMOTE, the double-caching will only last only until the 
next READ that misses in the array cache. We thus ex- 
pect DEMOTE to be more exclusive than DEMOTE-LRU, 
and so toresult in lower mean READ latencies. 


1.3. Objectives 


To evaluate the performance of our exclusive caching ap- 
proach, we aim to answer the following questions: 


1. Do demotions increase array cache hit rates in 
single-client systems? 


2. If so, what is the overall effiect of demotions on 
mean latency? In particular, do the costs exceed 
the benefits? Costs include extra SAN transfers, as 
well as delays incurred by READs that wait for DE- 
MOTEs to finish before proceeding. 


3. How sensitive are the results to variations in SAN 
bandwidth? 


4. How sensitive are the results to the relative sizes of 
the client and array caches? 


5. Do demotions help when an array has multiple 
clients? 


The remainder of the paper is structured as follows. We 
begin with a demonstration of the potential benefits of 
exclusive caching using some simple examples. We then 
explore how well it fares on more realistic workloads 
captured from real systems, and show that DEMOTE does 
indeed achieve the hoped-for benefits. 


Multi-client exclusive caching represents a more chal- 
lenging target, and we devote the remainder of the paper 
toan exploration of how this can be achieved— including 
a new way of thinking about cache insertion policies. 
After surveying related work, we end with our observa- 
tions and conclusions. 


2 Why exclusive caching? 


In this section, we explore the potential benefits of ex- 
clusive caching in single-client systems, using a simple 
analytical performance model. We show that exclusive 
caching has the potential to double the effective cache 
size with client and array caches of equal size, and that 
the potential speedups merit further investigation. 


We begin with a simple performance model for estimat- 
ing the costs and benefits of caching. We predict the 
mean latency seen by a client application as 


Tmean = The + (Ta + Te) ha + (19+ +T,) miss (1) 


where 7,. and 7,, are costs of a hit in the client and disk ar- 
ray caches respectively, 7, is the cost of reading a block 
from disk (since such a block is first read into the cache, 
and then accessed from there, it also incurs 7, + 7¢), he 
and fA, are the client and array cache hit rates respec- 
tively (expressed as fractions of the total client READs), 
and miss = 1 — (he + hg) is the miss rate (the fraction of 
all READs that must access the disk). Since 7, + 0, 


Tmean ® Taha + (Ta + T,) miss (2) 


In practice, 7, is much less than 7); 7, % 0.2 ms and 
T, = 4-10 ms for non-sequential 4 KB reads. 


We must also account for the cost of demotions. Large 
demotions will be dominated by data transfer times, 
small ones by array controller and host overheads. If 
we assume that a DEMOTE costs the same as a READ 
thathits in the array, and that clients demote a block for 
every READ, then we can approximate the cost of de- 
motions by doubling the latency of array hits. This is 
an upper bound, since demotions transfer no data if they 
abort, e.g., if the array already has the data cached. With 
the inclusion of demotion costs, 


Tenn $2 2Taha+ (27, + T,) miss. (3) 
We now use our model to explore some simple exam- 
ples, setting 7,, = 0.2 ms and 7, = 10 msthroughoutthis 
section. 


2.1 Random workloads 


Consider first a workload with a spatially uniform distri- 
bution of requests across some working set (also known 
as random). We expect that a client large enough to hold 
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Cumul. hit rate vs. effective cache size - ZIPF 


” anna we 7 


o 
oe 


o 
mn 


--x+ Inclusive 
- #- Exclusive 


Cumul. hit rate 


° 
ho 





De ee 
400 


SO 
Effective cache size (MB) 


0.0. 


Figure 2: Cumulative hit rate vs. effective cache size for a 
Zipf-like workl cad, with client and array caches of 64 mB each 
and a working set size of |28 MB. The marker shows the addi- 
tional array hit rate achieved with exclusive caching, 


half of the working set would achieve h, = 50%. An ar- 
ray with inclusive caching duplicates the client contents, 
and would achieveno additional hits, while an array with 
exclusive caching should achieve hi, = 50%. 


Equations 2 and 3 predict that the change from inclu- 
sive to exclusive caching would reduce the mean latency 
from 0.5(7,+ T,) to T,, i.e., from 5.1 ms to 0.2 ms. 


2.2 Zipf workloads 


Even workloads that achieve high client hit rates may 
benefit from exclusive caching. An example of such 
a workload is one with a Zipf-like distribution [49], 
whch approximates many common access patterns: a 
few blocks are frequently accessed, others much less of- 
ten. This is formalized as setting the probability of a 
READ for the i'" block proportional to 1 /i*, where @ is 
a scaling constant commonly set to I. 


Consider the cumulative hit rate vs. effective cache size 
graph shown in figure 2 for the Zipf workload with a 
128 MB working set. A client with a 64 MB cache 
will achieve h,. = 91%. No additional hits would oc- 
cur in the array with a 64 MB cache and traditional, fully 
inclusive caching. Exclusive caching would allow the 
same array to achieve an incremental /1, = 9%; because 
T, T,, even small decreases in the miss rate can 
yield large speedups. Equations 2 and 3 predict mean 
READ latencies of 0.918 ms and 0.036 ms for inclu- 
sive and exclusive caching respectivel y—an impressive 
25.5 speedup. 


3 Single-client synthetic workloads 


In this section, we explore the effects of exclusive 
caching using simulation experiments with synthetic 
workloads. Our goal is to confirm the intuitive argu- 
ments presented in section 2, as well as to conduct sen- 
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Figure 3: System simulated for the single-client workloads, 
with a RAIDS array anda I Gbit/s FibreChannel SAN. 


sitivity analyses for how our demotion scheme responds 
to variations in the client-array SAN bandwidth and rela- 
tive client and array cache sizes. Sections 4 and 5 present 
our results for real-life workloads. 


3.1 Evaluation environment: Pantheon 

To evaluate our cache management schemes, we began 
by using the Pantheon simulator [44], which includes 
calibrated disk models [33]. Although the Pantheon ar- 
ray models have not been explicitly calibrated, Pantheon 
has been used successfully in design studies of the HP 
AutoRAID disk array [45], so we have confidence in its 
pred ictive powers. 


We configured Pantheon to model a RAIDS disk array 
connected to a single client over a | Gbit/s FibreChannel 
link, as shown in figure 3. For these experiments, we 
used a workload with 4 KB READS, and set 7, = 0.2 ms; 
the Pantheon disk models gave T,, + 10 ms. 


The Pantheon cache models are extremely detailed, 
keeping track of I/O operations in 256 byte size units 
in order to model contention effects. Unfortunately, this 
requires large amounts of memory, and restricted us to 
experiments with only 64 MB caches. With a 4 KB cache 
block size, this means that the client and array caches 
were restricted to N, = Ng = 16384 blocks m size. 


To eliminate resource-contention effects for our syn- 
thetic workload results, we finished each READ before 
starting the next. In each experiment, we first “warm ed 
up” the caches witha working-set size set of READs; the 
performance of these READs is not included in the re- 
sults. Latency variances were all below 1%. 


Our chief metric for evaluating the exclusive caching 
schemes is the mean latency of a READ at the client; we 
also report on the array cache hit rate. For each result, 
we present both absolute latencies and a speedup ratio, 
whith Is the baseline (NONE-LRU) mean latency divided 
by the mean latency for the current experiment. Al- 
though the difficulties of modeling partially closed-loop 
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Workload | Client |NONE-LRU | DEMOTE-LRU| DEMOTE 


RANDOM| 50% 8% 
SEQ 0% 0% 
86% 2% 


Table 2: Client and array cache hit rates for single-client syn- 
thetic workloads. The client hit rates are the same for all the 
demotion variants, and can be added to the array hit rates to get 
the total cache hit rates. 










Workload |NONE-LRU| DEMOTE-LRU | DEMOTE 


RANDOM| 4.77ms |3.43 ms (1.39x)/0.64 ms (7.5x) 
SEQ 1.67 ms |1.91 ms (0.87x)|0.48 ms (3.5x) 
ZIPF 1.41 ms [1.19ms (1.18x)]0.85 ms (1.7x) 


Table 3: Mean READ latencies and speedups over NONE-LRU 
for single-client synthetic workloads. 






application behavior are considerable [16], a purely I/O- 
bound workload should see its execution time reduced 
by the speedup ratio. 


3.2 The RANDOM synthetic workload 


For this test, the workload consisted of one-block REA Ds 
uniformly selected from a working set of NV, blocks. 
Such random access patterns are common in on-line 
transaction-processing workloads (e.g., TPC-C, a clas- 


sic OLTP benchmark [38]). 


We set the working set size to the sum of the client and 
array cache sizes: N, = N, = 16384, N_ and = 22168 
blocks, and issued NV, warm-up READs, followed by 


10 XN ond timed READs. 


We expected that the client would achieve 4, = 50%. 
Inclusive caching would result in no cache hits at the 
array, while exclusive caching should achieve an addi- 
tional 4, = 50%, yielding a dramatic improvement in 
mean latency. 


The results in table 2 validate our expectations. The 
client achieved a 50% hit rate for both inclusive and ex- 
clusive caching, and the array with DEMOTE achieved an 
additional 46% hit rate. 4% of READSs still missed with 
DEMOTE, because the warm-up READs did completely 
fill the client cache. Also, since NONE-LRU is not fully 
inclusive (as previous studies demonstrate [15]), the ar- 
ray with NONE-LRU still achieved an 8% hit rate. 


As predicted in section 1.2, DEMOTE-LRU did not per- 
form as well as DEMOTE. DEMOTE-LRU only achieved 
hg = 21%, while DEMOTE achieved h, = 46%, which 
was a 7.5x speedup over NONE-LRU, as Seen in table 3. 


Figure 4 compares the cumulative latencies achieved 
with NONE-LRU and DEMOTE. For DEMOTE, the jump 
at 0.4 ms corresponds to the cost of an array hit plus the 


cost of a demotion. In contrast, NONE-LRU got fewer ar- 
ray hits (table 2), and its curve has a significantly smaller 
jump at 0.2 ms, which 1s the cost of an array cache hit 
without a demotion. 


3.3. The SEQ synthetic workload 


Sequential accesses are common in scientific, decision- 
support and data-mining workloads. To evaluate the 
benefit of exclusive caching for such workloads, we 
simulated READs of sequential blocks from a work- 
ing set Of Ng contiguous blocks, chosen so that the 
working set would fully occupy the combined client 
and array caches: N, = Ng = 16384, and Neeg = Ne + 
Na ~ 1 = 32767 blocks (the —1 accounts for double- 
caching of the most recently read block). We issued Neg 
warm-up READs, followed by 10 x Neg timed one-block 
READS. 


We expected that at the end of the warm-up period, 
the client would contain the blocks in the second half 
of the sequence, and an array under exclusive caching 
would contain the blocks in the first half. Thus, with 
DEMOTE, all subsequent READs should hit in the array. 
On the other hand, with NONE-LRU and DEMOTE-LRU, 
we expected that the array would always contain the 
same blocks as the client; neither the client nor the ar- 
ray would have the next block in the sequence, and all 
READs would miss. 


Again, the results in table 2 validate our expectations. 
Although no READs ever hit in the client, they all hit in 
the array with DEMOTE. The mean latency for DEMOTE- 
-LRU was higher than for NONE-LRU because it point- 
lessly demoted blocks that the array discarded before 
they were reused. Although all READs missed in both 
caches with NONE-LRU and DEMOTE-LRU, the mean la- 
tencies of 1.67 ms and 1.91 ms respectively were less 
than the random-access disk latency 7, thanks to read- 
ahead in the disk drive [33]. 


The cumulative latency graph in figure 4 further demon- 
Strates the benefit of DEMOTE over NONE-LRU: all 
READs with DEMOTE had a latency of 0.4 ms (the cost 
of an array hit plus a demotion), while all READs with 
NONE-LRU had latencies between 1.03 ms (the cost of a 
disk access with read-ahead caching) and 10 ms (the disk 
latency 7, incurred when the READ sequence wraps 
around). Overall, DEMOTE achieved a 3.5x speedup 
over NONE-LRU, as seen in table 3. 


3.4 The ZIPF synthetic workload 
Our Zipf workload sent READs from a set of Ning 


blocks, with Noing = 1.5(N. + Na), so for No = Na 
= 16384, Noing = 49152. This resulted in three equal 
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Figure 4: Cumulative READ fraction vs. mean READ latency for the RANDOM, SEQ, and ZIPF workloads with NONE-LRU and 


DEMOTE. 


size sets of Nz,,-/3 blocks: Z, for the most active third 
(which received 90% of the accesses), Z, for the next 
most active (6% of the accesses), and Z, for the least ac- 
tive (the remaining 4% of the accesses). We issued Nips 
warm-up READs, followed by 10 XN ain timed READs. 


We expected that at the end of the warm-up set, the client 
cache would be mostly filled with blocks from Z) with 
the highest request probabilities, and that an array under 
exclusive caching would be mostly filled with the blocks 
from Z, with the next highest probabilities. With our 
test workload, exclusive caching schemes should thus 
achieve h,. = 90% and h, = 6% in steady state. On the 
other hand, the more inclusive caching schemes (NONE- 
-LRU and DEMOTE-LRU) would simply populate the ar- 
ray cache with the most-recently read blocks, which 
would be mostly from Zp, and thus achieve a lower array 
hit rate fg. 


The results in table 2 validate our expectations. The 
client always achieved , = 86% (slightly lower than 
the anticipated 90% due to an incomplete warm-up). But 
there was a big difference in hz: DEMOTE achieved 9%, 
while NONE-LRU achieved only 2%. 


The cumulative latency graph in figure 4 supports this: 
as with RANDOM, the curve for DEMOTE has a much 
larger jump at 0.4 ms (the cost of an array hit plus a 
demotion) than NONE-LRU does at 0.2 ms (the cost of 
an array hit alone). Overall, DEMOTE achieved a 1.7x 
speedup over NONE-LRU, as seen in table 3. This may 
seem surprising given the modest increase in array hit 
rate, but is more readily understandable when viewed as 
a decrease in the overall miss rate from 12% to 5%. 


3.5 SAN bandwidth sensitivity analysis 


Exclusive caching using demotions relies on a low- 
latency, high-bandwidth SAN to allow the array cache to 
perform as a low-latency extension of the client cache. 
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Figure 5: Mean READ latency vs. SAN bandwidth for the 
RANDOM and ZIPF workloads. 


The more this expectation is violated (1.e., as SAN la- 
tency increases), the less benefit we expect to see— 
possibly to the point where demotions are not worth 
doing. To explore this effect, we conducted a sensi- 
tivity analysis, using Pantheon to explore the effects of 
varying the simulated SAN bandwidth from 10 Gbit/s to 
10 Mbit/s on the NONE-LRU and DEMOTE schemes. 


Our experiments validated our expectations. Figure 5 
shows that at very low effective SAN bandwidths (less 
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than 20-30 Mbit/s), NONE-LRU outperformed DEMOTE, 
but DEMOTE won as soon as the bandwidth rose above 
this threshold. The results for RANDOM and ZIPF are 
similar, except that the gap between the NONE-LRU and 
DEMOTE curves for high-bandwidth networks is smaller 
for ZIPF since the increase in array hit rate (and the re- 
sultant speedup) was smaller. 


3.6 Evaluation environment: fscachesim 


For subsequent experiments, we required a simulator ca- 
pable of model ing multi-gigabyte caches, which was be- 
yond the abilities of Pantheon. To this end, we devel- 
oped a simulator called fscachesim that only tracks 
the client and array cache contents, omitting detailed 
disk and SAN latency measurements. fscachesim Is 
simpler than Pantheon, but its predictive effects for our 
study are similar: we repeated the experiments described 
In sections 3.2 and 3.4 with identical workloads, and 
confirmed that the client and array hit rates matched ex- 
actly. We used fscachesim for all the experimental 
work described in the remainder of this paper. 


3.7 Cache size sensitivity analysis 


In the results reported so far, we have assumed that the 
client cache is the same size as the array cache. This sec- 
tion re ports on what happens if we relax this assum ption, 
us ing a 64 MB client cache and RANDOM and ZIPF. 


We expected that an array withthe NONE-LRU inclusive 
scheme would provide no reduction in mean latency un- 
til its cache size exceeds that of the client, while one with 
the DEMOTE exclusive scheme would provide reductions 
in mean latency for any cache sizeuntil the working set 
fits in the aggregate of the client and array caches. 


The results in figure 6 confirm our expectations. Max- 
imum benefit occurs when the two caches are of equal 
size, but DEMOTE provides benefits over roughly a 10:1 
ratio of cache sizes on either side of the equal-size case. 


3.8 Summary 


The synthetic workload results show that DEMOTE of- 
fers significant potential benefits: 1.7—-7.5x speedups 
are hard to ignore. Better yet, these benefits are mostly 
insensitive to variations in SAN bandwidth and only 
moderatel y sensitive to the client:array cache sie ratio. 


Since our results showed that DEMOTE-LRU never out- 
performed DEMOTE, we did not consider it further. We 
also investigated schemes with different combinations of 
LRU and most-recently-used (MRU) replacement pol1- 
cies at the client and array in conjunction with demo- 
tions, and found that none performed as well as DE- 
MOTE. 
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Figure 6: Mean READ latency vs. array cache size for the 
RANDOM and 2!IPF workloads. The client cache size was fixed 
at 64 MB. The 64 MB Size is marked with a dotted line. 
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Table 4: Real-life workload data, with date, storage capacity, 
array cache size, client count, trace duration, and I/O count. 
‘Wartn-up’ Is the fraction of the trace used to pre-load the 
caches In our experiments. For DB2 and HTTPD, working set 
size instead of capacity is shown. ‘—’ are unknown entries. 


4 Single-client real-life workloads 


Having demonstrated the bene fits of dem otion-base d ex- 
clusive caching for synthetic workloads, we now eval- 
uate its benefits for real-life workloads, in the form of 
traces taken from the running systems shown in table 4. 


Some of the traces avail able to us are somewhat old, and 
cache sizes considered impressive then are small today. 
Given this, we set the cache sizes in our experiments 
commensurate with the time-frame and scale of the sys- 
tem from which the traces were taken. 


We used fscachesimto simulate a system model sim- 
ilar to the one in figure 3, with cache sizes scaled to 
reflect the data in table 4. We used equations 2 and 3 
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Workload | Client) NONE-LRU DEMOTE | 


CELLO99| 54% 2.34 ms | (3%| 1.83 ms (1.28x) 


5.01 ms |33%|3.57 ms (1.40x) 
0.24 ms (2.20x) 





Table 5: Client and array hit rates and mean latencies for 
single-client real-life workloads. Client hit rates are the same 
for all schemes. Latencies are computed using equations 2 and 
3 with 7, = 0.2 ms and 7, = 5 ms. Speedups for DEMOTE over 
NONE-LRU are also shown. 

with 7, = 0.2 ms, 7, = 5 ms to convert cache hit rates 
into mean latency predictions. This disk latency 1s more 
aggressive than that obtained from Pantheon, to reflect 
the improvements in disk performance seen in the more 
recent systems. We further assumed that there was suf- 
ficient SAN bandwidth to avoid contention, and set the 
cost of an aborted demotion to 0.16 ms (the cost of SAN 
controller overheads without an actual data transfer). 


As before, our chief metric of evaluation is the im- 
provement in the mean latency of a READ achieved by 
demotion-based exclusive caching schemes. 


4.1 The CELLO9Y9 real-life workload 


The CELLOYY workload comprises a trace of every disk 
I/O access for the month of April 1999 from an HP 9000 
K570 server with 4 CPUs, about 2 GB of main memory, 
two HP AutoRAID arrays and 18 directly connected disk 
drives. The system ran a general time-sharing load un- 
der HP-UX 10.20; it is the successor to the CELLO sys- 
tem Ruemmler and Wilkes describe in their analysis of 
UNIX disk access patterns [32]. In our experiments, we 
simulated 2 GB client and array caches. 


Figure 7 suggests that that switching from inclusive to 
exclusive caching, with the consequent doubling of ef- 
fective cache size from 2 GB to 4 GB, should yield a 
noticeable increase in array hit rate. The results shown 
in table 5 demonstrate this: using DEMOTE achieved 
ha = 13% (compared to hg =1% with NONE-LRU), 
yielding a 1.28x speedup-—solely from changing the 
way the array cache 1s managed. 


4.2 The DB2 real-life workload 


The DB2 trace-based workload was generated by an 
eight-node IBM SP2 system running an IBM DB2 
database application that performed join, set and aggre- 
gation operations on a 5.2 GB data set. Uysai et al. used 
this trace in their study of 1/O on parallel machines [40]. 


The eight client nodes accessed disjoint sections of the 
database; for the single-client workload experiment we 
combined all these access streams into one. 


OB2 exhibits a behavior between the sequential and ran- 
dom workload styles seen in the SEQ and RANDOM syn- 
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Figure 7: Cumulative hitrate vs. cache size graphs for single- 
client real-life workloads. 


thetic workloads. The graph for DB2 in figure 7 suggests 
that a single 4 GB cache would achieve about a 37% hit 
rate, but that a split cache with 2 GB at each of the client 
and array would achieve almost no hits at all with in- 
clusive caching; thus, DEMOTE should do much better 
than NONE-LRU. The results shown in table 5 bear this 
out: DEMOTE achieved a 33% array hit rate, anda 1 .40x 
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Array size|Client| | NONE-LRU 
23% |0%|4.01 ms 
23% |0%| 4.01 ms (1.00x) 


‘DEMOTE 


14.13 ms (0.97x) 


3.86 ms (1.04x)| 
23% | 1%]3.97 ms (1.01 x)113%13.54 ms (1.13x) 





Table 6: Client and array hit rates and mean latencies for 
single-client TPC-H for different array caches. Client hit rates 
and cache sizes (32 GB) are the same for all schemes. Laten- 
cies are computed using equations 2 and 3 with 7, = 0.2 ms 
and 7, = 5 ms. Speedups are with respect to a2 GB array cache 
with NONE-LRU. 


speedup over NONE-LRU. 
4.3. The HTTPD real-life workload 


The HTTPD workload was generated by a seven-node 
IBM SP2 parallel web server [22] serving a 524 MB data 
set. Uysal et al. also used this trace in their study [40]. 
Again, we combined the client streams into one. 


HTTPD has similar characteristics to ZIPF. A single 
256 MB cache would hold the entire active working set; 
we elected to perform the experiment with 128 MB of 
cache split equally between the client and the array in 
order to obtain more interesting results. An aggregate 
cache of this size should achieve h,+ hg = 95% accord- 
ing to the graph in figure 7, with the client achieving 
h, = 85%, and an array under exclusive caching the re- 
maining hg & 10%. 


Table 5 shows that the expected benefit indeed occurs: 
DEMOTE achieved a 10% array hit rate, and an impres- 
sive 2.2x speedup over NONE-LRU. 


4.4 The TPC-H real-life workload 


The TPC-H workload is a ]-hour portion of a 39-hour 
trace of a system that performed an audited run [18] 
of the TPC-H database benchmark [37]. This sys- 
tem illustrates high-end commercial decision-support 
systems: it comprised an 8-CPU (550MHz PA-RISC) 
HP 9000 N4000 server with 32 GB of main memory and 
2.1 TB of storage capacity, on 124 disks spread across 
3 arrays (with 1.6 GB of aggregate cache) and 4 non- 
redundant disk trays. The host computer was already 
at its maximum-memory configuration in these tests, so 
adding additional host memory was notan option. Given 
that this was a decision-support system, we expected to 
find a great deal of sequential traffic, and relatively lit- 
tle cache reuse. Our expectations are bome out by the 
results. 


In our TPC-H experiments, we used a 16 KB block size, 
a 32 GB client cache, and a 2 GB array cache as the base- 
line, and explored the effects of changing the array cache 
size up to 32 GB. Table 6 shows the results. 


The traditional, inclusive caching scheme showed no im- 


provement in latency until the array cache size reached 
32 GB, at which point we saw a tiny (1%) improvement. 


With a 2 GB array cache, DEMOTE yielded a slight slow- 
down (0.97 x speedup), because it paid the cost of do- 
ing demotions without increasing the array cache hit 
rate significantly. However, DEMOTE obtained a 1.04x 
speedup at 16 GB, and a 1.13 speedup at 32 GB, while 
the inclusive caching scheme showed no benefits. This 
data confirms that cache reuse was not a major factor in 
this workload, but indicates that the exclusive caching 
scheme took advantage of what reuse there was. 


4.5 Summary 


The results from real-life workloads support our earlier 
conclusions: apart from the TPC-H baseline, which ex- 
perienced a small 0.97x slowdown due to the cost of 
non-beneficial demotions, we achieved up to a 2.20x 
speedup. 


We find these results quite gratifying, given that ex- 
tensive previous research on cache systems enthusias- 
tically reports performance improvements of a few per- 
cent (e.g.,a ~1.12x speedup). 


5S Miulti-client systems 


Multi-client systems introduce a new complication: the 
sharing of data between clients. Note that we are delib- 
erately not trying to achieve client-memory sharing, in 
the style of protocols such as GMS [13, 42]. One benefit 
is that our scheme does not need to maintain a directory 
of which clients are caching which blocks. 


Having multiple clients cache the same block does not 
itself raise problems (we assume that the clients wish 
to access the data, or they would not have read it), but 
exploiting the array cache as a shared resource does: it 
may no longer be a good idea to discard a recently read 
block from the array cache as soon as it has been sent 
to a client. To help reason about this, we consider two 
boundary cases here. Of course, real workloads show 
behavior between these extremes. 


Disjoint workloads: The clients each issue READs for 
non-overlapped parts of the aggregate working set. The 
READs appear to the array as if one client had issued 
them, from a cache as large as the aggregate of the client 
caches. To determine if exclusive caching will help, we 
use the cumulative hit rate vs. cache size graph to esti- 
mate the array hit rate as if a single client had issued all 
READs, as in section 2. 


Conjoint workloads: The clients issue exactly the same 
READ requests in the same order at the exact same time. 
If we arbitrarily designate the first client to issue an I/O 
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as the leader, and the others as followers, we see that 
READs that hit in the leader also will hit in the followers. 
The READs appear to the array as 1f one client had issued 
them from a cache as large as an individual client cache. 


To determine if the leader will benefit from exclusive 
caching, we use the cumulative hit rate vs. cache size 
graph to estimate the array hit rate as if the leader had 
issued all READS, as in section 2. 


To determine if the followers will benefit from exclu- 
sive caching, we observe that all READs that miss for the 
leader in the array will also cause the followers to stall, 
waiting for that block to be read into the array cache. As 
soon as it arrives there, it will be sent to the leader, and 
then all the followers, before it is discarded. That is, the 
followers will see the same performance as the leader. 


In systems that employ demotion, the followers waste 
time demoting blocks that the leader has already de- 
moted. Fortunately, these demotions will be relatively 
cheap because they need not transfer any data. 


5.1 Adaptive cache insertion policies 


Our initial results using the simple demotion-based ex- 
clusive caching scheme described above to multi-client 
systems were mixed. At first, we evaluated NONE-LRU 
and DEMOTE in a multi-client system similar to the one 
shown in figure 3, with the single client shown in that 
figure simply replaced by N clients, each with t /Nof the 
cache memory of the single client. As expected, work- 
loads in which clients shared few or no blocks (disjoint 
workloads) benefitted from DEMOTE. 


Unfortunately, workloads in which clients shared blocks 
performed worse with DEMOTE than with NONE-LRU, 
because shared workloads are not conjoint in practice: 
clients do not typically READ the same blocks in the 
same order at the same time. Instead, a READ for block 
X by one client may be followed by several READs for 
other blocks before a second READ for X by another 
client. Recall that with DEMOTE the array puts blocks 
read from disk at the head of the LRU queue, 1.e., in 
MRU order. Thus, the array is likely to eject X before 
the READ from the later client. 


We made an early design decision to avoid the complex- 
ities of schemes that require the array to track which 
clients had which blocks and request copies back from 
them—we wanted to keep the client-to-array interaction 
as simple, and as close to standard SCSI, as possible. 


Our first insight was that the array should reserve a por- 
tion of its cache to keep blocks recently read from disk 
“for a while”, in case another client requests them. To 
achieve this, we experimented with a segmented LRU 
(SLRU) array cache [21] ]-—one with probationary and 
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Figure 8: Operation of read and demoted ghost caches 1n con- 
junction with the array cache. The array inserts the metadata of 
incoming read (demoted) blocks into the corresponding ghost, 
and the data into the cache. The cache 1s divided into segments 
of either uniform or exponentially-growing size. The array se- 
lects the segment into which to insert the incoming read (de- 
moted) block based on the hit count in the corresponding ghost. 


protected segments, each managed in LRU fashion. The 
array puts newly inserted blocks (read and demoted) at 
the tail of the probationary segment, and moves them to 
the tail of the protected segment if a subsequent READ 
hits them. The array moves blocks from the head of the 
protected segment to the tail of the probationary one, and 
ejects blocks from the head of the probationary segment. 


SLRU improved performance somewhat, but the opti- 
mal size of the protected segment varied greatly with the 
workload: the best size was either very small (less than 
8% of the total), or quite large (over 50%). These results 
were less robust than we desired. 


Our second insight is that the array can treat the LRU 
queue as a continuum, rather than as a pair of segments: 
inserting a block near the head causes that block to have 
a shorter expected lifetime in the queue than inserting it 
near the tail. We can then use different insertion points 
for demoted blocks and disk-read blocks. (Pure DE- 
MOTE is an extreme instance that only uses the ends of 
the LRU queue, and SLRU 1s an instance where the in- 
sertion point is a fixed distance down the LRU queue.) 


Our experience with SLRU suggested that the array 
should select the insertion points adaptively in response 
to workload characteristics instead of selecting them 
statically. For example, the array should insert demoted 
blocks closer to the tail of its LRU queue than disk-read 
blocks if subsequent READs hit demoted blocks more of- 
ten. To support this, we implemented ghost caches at the 
array for demoted and disk-read blocks. 


A ghost cache behaves like a real cache except that 
it only keeps cache metadata, enabling it to simulate 
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the behavior of a real cache using much less memory. 
We used a pair of ghost caches to simulate the perfor- 
mance of hypothetical array caches that only inserted 
blocks from a particular source-—either demotions or 
disk reads. Just like the real cache, each ghost cache 
was updated on READs to track hits and execute its LRU 
policy. 


We used the ghost caches to provide information about 
which insertion sources are the more likely to insert 
blocks that are productive to cache, and hence where in 
the real cache ftiture insertions from this source should 
go, as shown in figure 8.) This was done by calculat- 
ing the insertion point in the real cache from the relative 
hit counts of the ghost caches. To do so, we assigned 
the value 0 to represent the head of the real array LRU 
queue, and the value | to the tail; the insertion points for 
demoted and disk-read blocks were given by the ratio of 
the hit rates seen by their respective ghost caches to the 
total hit rate across all ghost caches. 


To make insertion at an arbitrary point more computa- 
tionally tractable, we approximated this by dividing the 
real array LRU queue into a fixed number of segments 
Neegs (10 in our experiments), multiplying the calculated 
insertion point by Nsegs, and inserting the block at the tail 
of that segment. 


We experimented with uniform segments, and with ex- 
ponential segments (each segment was twice the size of 
the preceding one, the smallest being at the head of the 
array LRU queue). The same segment-index calculation 
was used for both schemes, causing the scheme with seg- 
ments of exponential size to give significantly shorter 
lifetimes to blocks predicted to be less popular. 


We designated the combination of demotions with ghost 
caches and uniform segments at the array as DEMOTE- 
-ADAPT-UNI, and that of demotions with ghost caches 
and exponential segments as DEMOTE-ADAPT-EXP. We 
then re-ran the experiments for which we had data for 
multiple clients, but separated out the individual clients. 


5.2. The multi-client DB2 workload 


We used the same DB2 workload described in sec- 
tion 4.2, but with the eight clients kept separate. Each 
client had a 256 MB cache, so the aggregate of client 
caches remained at 2 GB. The array had 2 GB of cache. 


Each DB2 client accesses disjoint parts of the database. 
Given our qualitative analysis of disjoint workloads, and 
the speedup for DB2 ina single-client system with DE- 
MOTE, we expected to obtain speedups in this multi- 
client system. If we assume that each client uses one 
eighth (256 MB) of the array cache, then each client has 
an aggregate of 512 MB to hold its part of the database, 
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Figure 9: Cumulative hit rate vs. cache size for OB2 clients. 
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Table 7: Per-client mean latencies (in ms) for multi-client 
DB2. Latencies are computed using equations 2 and 3 with 
T, = 0.2 ms and T, = 5 ms. Speedups over NONE-LRU, and 
the geometric mean of all client speedups, are also shown. 


and we expected from figure 9 that exclusive caching 
would obtain a significant increase in array hit rates, with 
a corresponding reduction in mean latency. 


Our results shown in table 7 agree: DEMOTE achieved an 
impressive 1.50x speedup over NONE-LRU. DEMOTE- 
-ADAPT-UNI and DEMOTE-ADAPT-EXP achieved only 
].27—] .32x speedups, since they were more likely to 
keep disk-read blocks in the cache, reducing the cache 
available for demoted blocks, and thus making the cache 
less effective for this workload. 


5.3. The multi-client HTTPD workload 


We retumed to the original HTTPO workload, and sepa- 
rated the original clients. We gave 8 MB to each client 
cache, and kept the 64 MB array cache as before. 


Figure 10 indicates that the per-client workloads are 
somewhat similar to the ZIPF synthetic workload. As 
shown in section 3.4, disk-read blocks for such work- 
loads will in general have low probabilities of being 
reused, while demoted blocks will have higher proba- 
bilities. On the other hand, as shown by the histogram 
in table 8, clients share a high proportion of blocks, and 
tend to exhibit conjoint workload behavior. Thus, while 
the array should discard disk-read blocks more quickly 
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Figure 10: Cumulative hit rate vs. cache size for HTTPD 


clients. 


(No. clients | _1 i 2h | 4a ot 6] __7] 

No. blocks |] 13173 [8282] 5371 | 5570/6934] 24251 | 5280 

% of total 19%} 12%] 8%) 8%| 10%] 35%] 8%! 
Table 8: Histogram showing the number of blocks shared by 
x HTTPD clients, where x ranges from | to 7 clients. 
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Table 9: Per-client mean latencies (in ms) for multi-client 
HTTPD. Latencies are computed using equations 2 and 3 with 
TY = 0.2 ms and T, = 5 ms. Speedups over N@NE-LRU, and 
the geometric mean of all client speedups, are also shown. 


than demoted blocks, it should not discard them imme- 
diately. 


Given this analysis, we expected DEMOTE to post less 
impressive results than adaptive schemes, and indeed it 
did, as shown In table 9: a 0.55Sx slowdown in mean 
latency over NONE-LRU. On the other hand, DEMOTE- 
-ADAPT-EXP achieved a 1.18x speedup. DEMOTE- 
-ADAPT-UNI achieved a 0.91 x slowdown, which we 
attribute to demoted blocks being much more valuable 
than disk-read ones, but the cache with uniform seg- 
ments devoting too little of its space to them compared 
to the one with exponential segments. 


§.4 The OPENMAIL workload 


The OPENMAIL workload comes from a trace of a pro- 
duction e-mail system running the HP OpenMail appli- 
cation for 25,700 users, 9,800 of whom were active dur- 
ing the hour-long trace. The system consisted of six 


General Track: 2002 USENIX Annual Technical Conference 


Cumul. hit rate vs. cache size - OPENMAIL clients 
10- 







0.8 


2 Fee 
© os te 
= er 
= v — Client 1 
= os a> Client 2 
36 3 --- Client 3 
Pe ~-- Client 4 
0.2 (ee ---- Client 5 
----- Client 6 


Sie 2000 3000 4000 
Cache size (MB) 


Figure 11: Cumulative hit rate vs. cache size for OPENMAIL. 
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Table 10: Per-client mean latencies (in ms) for OPEN- 
MAIL. Latencies are computed using equations 2 and 3 with 
T, = 0.2 ms and T, = 5 ms. Speedups over NONE-LRU, and 
the geometric mean of all client specdups, are also shown. 


HP 9000 KS580 servers running HP-UX 10.20, each with 
6 CPUs, 2 GB of memory, and 7 SCSI interface cards. The 
servers were attachedto four EMC Symmetrix 3700 disk 
arrays. At the time of the trace, the servers were experi- 
encing some load imbalances, and one was I/O bound. 


Figure 11 suggests that 2 GB client caches would hold 
the entire working set for all but two clients. To obtain 
more interesting results, we simulated six clients with 
1 GB caches connected to an array with a 6 GB cache. 


OPENMALL is a disjoint workload, and thus should ob- 
tain speedups from exclusive caching. If we assume that 
each client uses a sixth (1 GB) of the array cache, then 
each client has an aggregate of 2 GB to hold its work- 
load, and we see from figure 1! that an array under ex- 
clusive caching array should obtain a significant increase 
in array cache hit rate, and a corresponding reduction in 
mean latency. 


As with D82, our results (table 10) bear out our ex- 
pectations: DEMOTE, which aggressively discards read 
blocks and holds demoted blocks in the array, obtained 
a 1.15x speedup over NONE-LRU. DEMOTE-ADAPT- 
UNI and DEMOTE-ADAPT-EXP fared less well, yielding 
a 1.07x speedup and 0.88x slowdown respectively. 
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5.5 Summary 


The clear benefits from single-client workloads are not 
so easily repeated in the multi-client case. For largely 
disjoint workloads, such as DB2 and OPENMAIL, the 
simple DEMOTE scheme does well, but it falls down 
when there is a large amount of data sharing. On 
the other hand, the adaptive demotion schemes do well 
when simple DEMOTE fails, which suggests that a mech- 
anism to switch between the two may be helpfial. 


Overall, our results suggests that even when demotion- 
based schemes seem not to be ideal, it is usually possible 
to find a setting where performance is improved. In the 
enterprise environments we target, such tuning is an ex- 
pected part of bringing a system into production. 


6 Related work 


The literature on caching in storage systems is large and 
rich, so we only cite a few representative samples. Much 
of it focuses on predicting the performance of an exist- 
ing cache hierarchy [6, 24, 35, 34], describing existing 
V/O systems [17, 25, 39], and determining when to flush 
write-back data to disk [21, 26, 41]. Real workloads con- 
tinue to demonstrate that read caching has considerable 
value in arrays, and that a small amount of non-volatile 
memory greatly improves write performance [32, 39]. 


We are not the first to have observed the drawbacks 
of inclusive caching. Muntz ef al. [27, 28] show 
that intermediate-layer caches for file servers perform 
poorly, and much of the work on cache replacement al- 
gorithms is motivated by this observation [21, 24, 30, 
48]. Our DEMOTE scheme, with alternative array cache 
replacement policies, is another such remedy. 


Choosing the correct cache replacement policy in an ar- 
ray can improve its performance [19, 21, 30, 35, 48]. 
Some studies suggest using least-frequently-used [15, 
46] or frequency-based [30] replacement policies instead 
of LRU in file servers. MRU [23] or next-block pre- 
diction [29] policies have been shown to provide better 
performance for sequential loads. LRU or clocking polli- 
cies [10] can yield acceptable results for database loads; 
for example, the IBM DB2 database system [36] imple- 
ments an augmented LRU-style policy. 


Our DEMOTE operation can be viewed as a very sim- 
ple form of a client-controlled caching policy [7], which 
could be implemented using the “write to cache” opera- 
tion available on some arrays (e.g., those from IBM [3]). 
The difference is that we provide no way for the client 
to control which blocks the array should replace, and we 
trust the client to be well-behaved. 


Recent studies of cooperative World Wide Web caching 


protocols [1], 20, 47] look at policies beyond LRU 
and MRU. Previously, analyses of web request traces 
[2, 5, 8] showed the file popularity distributions to be 
Zipf-like [49]. It is possible that schemes tuned for these 
workloads will perform as well for the sequential or ran- 
dom access patterns found in file system workloads, but 
a comprehensive evaluation of them is outside the scope 
of this paper. [n addition, web caching, with its poten- 
tially millions of clients, is targeted at a very different 
environment than our work. 


Peer-to-peer cooperative caching studies are relevant to 
our multi-client case. In the “direct client coopera- 
tion” model [9], active clients offload excess blocks onto 
idle peers. No inter-client sharing occurs-—cooperation 
is simply a way to exploit otherwise unused memory. 
The GMS global memory management project consid- 
ers finding the nodes with idle memory [13, 42]. Coop- 
erating nodes use approximate knowledge of the global 
memory state to make caching and ejection decisions 
that benefit a page-faulting client and the whole cluster. 


Perhaps the closest work to ours in spirit is a global 
memory management protocol developed for database 
management systems [14]. Here, the database server 
keeps a directory of pages in the aggregate cache. This 
directory allows the server to forward a page request 
from one client to another that has the data, request that 
a client demote rather than discard the last in-memory 
copy of a page, and preferentially discard pages that 
have already been sent to a client. We take a simpler 
approach: we do not track which client has what block, 
and thus cannot support inter-client transfers—but we 
need neither a directory nor major changes to the SCSI 
protocol. We rely on a high-speed network to perform 
DEMOTE eagerly (rather than first check to see if it 1s 
worthwhile) and we do not require a (potentially large) 
data structure at the array to keep track of what blocks 
are where. Lower complexity has a price: we are less 
able to exploit block sharing between clients. 


7 Conclusion 


We began our study with a simple idea: that a DEMOTE 
operation might make array caches more exclusive and 
thus achieve better hit rates. Experiments with simple 
synthetic workloads support this hypothesis; moreover, 
the benefits are reasonably resistant to reductions in 
SAN bandwidth and variations in array cache size. Our 
hypothesis is furthersupported by 1.04—2.20x speedups 
for most single-client real-life workloads we studied— 
and these are significantly larger than several results for 
other cache improvement algorithms. 


The TPC-H system parameters show why making ar- 
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ray caches more exclusive is important in large systems: 
cache memory for the client and arrays represented 32% 
of the total system cost of $1.55 million [18]. The abil- 
ity to take full advantage of such large investments is a 
significant benefit; reducing their size is another. 


Using multiple clients complicates the story, and our re- 
sults are less clear-cut in such systems. Although we 
saw up to a 1.5x speedup with our exclusive caching 
schemes, we incurred a slowdown with the simple DE- 
MOTE scheme when clients shared significant parts of 
the working set. Combining adaptive cache-insertion 
algorithms with demotions yielded improvements for 
these shared workloads, but penalized disjoint work- 
loads. However, we believe that it would not be hard to 
develop an automatic technique to switch between these 
simple and adaptive modes. 


In conclusion, we suggest that the DEMOTE scheme is 
worth consideration by system designers and I/O archi- 
tects, given our generally positive results. Better yet, 
as SAN bandwidth and cache sizes increase, its ben- 
efits will likely increase, and not be wiped out by a 
few months of processor, disk, or memory technology 
progress. 
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Abstract 


The functionality and performance innovations in file sys- 
tems and storage systems have proceeded largely indepen- 
dently from each other over the past years. The result is an 
information gap: neither has information about how the other 
is designed or implemented, which can resultina high cost of 
maintenance, poor performance, duplication of features, and 
limitations on functionality. To bridge this gap, we introduce 
and evaluate a new division of labor between the storage sys- 
tem and the file system. We develop an enhanced storage layer 
known as Exposed RAID (Ex RAID), which reveals informa- 
tion to file systems built above; specifically, Ex RAID exports 
the parallelism and failure-isolation boundaries of the storage 
layer, and tracks performance and failure characteristics ona 
fine-grained basis. To take advantage of the information made 
available by Ex RAID, we develop an Informed Log-Structured 
File System (l-LFS). I-LFS is an extension of the standard log- 
structured file system (LFS) that has been altered to take ad- 
vantage of the performance and failure information exposed 
by Ex RAID. Experiments reveal that our prototype implemen- 
tation yields benefits inthe management, flexibility, reliability, 
and performance of the storage system, with only a small in- 
crease in file system complexity. For example, I-LF S/ExRAID 
can incorporate new disks into the system on-the-fly, dynami- 
cally balance workloads across the disks of the system, allow 
for user control of file replication, and delay replication of files 
for increased performance. Much of this functionality would 
be difficult or impossible to implement with the traditional di- 
vision of labor between file systems and storage. 


1 Introduction 


A chasm exists in the world of file storage and man- 
agement. Though a hierarchical file system of directo- 
ries and byte-accessible files has been the norm for al- 
most 30 years [27], the internals of file systems and un- 
derlying storage systems have evolved substantially, im- 
proving both performance [23] and functionality [33]. 

In file systems, many approaches have been developed 
to improve performance, including read-optimized inode 
and file placement [23], loggingof writes [30], improved 
meta-data update methods [39], more scalable internal 


data structures [41], and off-line reorganization strate- 
gies [22]. However, almost all such techniques have 
been developed under the assumption that the file sys- 
tem will be run upon a single, traditional disk. 

More recently, storage systems have also received 
much attention. For example, “smart” disks can im- 
prove read or write performance with block remapping 
techniques [11, 13, 49]. For I/O-intensive workloads, 
multi ple-disk storage s ystems have been well studied in 
the research community (26, 51], and have achieved suc- 
cess in the storage industry. 

These high-end storage systems provide the illusion 
of a single, fast disk to unsuspecting file systems above, 
but intemally manage both parallelism and redundancy 
to optimize for performance, capacity, or even both [51 ]. 
Analogous to file s ystems, storage systems are often de- 
veloped with a single (FFS-like) files ystem in mind. 

While these changes in both file systems and parallel 
disk systems have been substantial, they have also been 
separate, and the result is an information gap: the file 
system does not understand the true nature of the stor- 
age system it runs upon, and the storage system cannot 
comprehend the semantic relations bet ween the blocks it 
stores. In addition, each is unaware of the state the other 
tracks and the optimizations that the other performs. 

This gap arose from a historical source: the hard- 
ware/software boundary. File systems have traditionally 
expected a block-based read/write interface to storage, 
because that interface is quite similar to what a single 
disk exports. With the advent of hard ware-based R AID 
systems [26], storage vendors took advantage of the free- 
dom to innovate behind this interface, and thus devel- 
oped high-performance, high-capacity systems that ap- 
peared as a single, large, and fast disk to the file s ystem. 
No soft ware modifications were required of the host op- 
erating system, and file systems continued to operate 
correctly, in spite of the fact that they were often opti- 
mized for a single-disk system. In this case, ignorance 
was bliss; the arrangement was simple and worked well. 

However, the boundary between file system and stor- 
age system is changing, migrating towards a soft ware- 
structuring technique rather than an interface necessi- 
tated by hardware. Soft ware RAID drivers are available 
on many platforms [7], and with the advent of net work- 
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attached storage [14], client-side striping software can 
replace the need for hardware-based RAID systems en- 
tirely. Such software-based RAIDs are particularly at- 
tractive due to their low cost, e.g., in a Linux-based sys- 
tem, one incurs only the cost of the machine and disks. 

We term the arrangement of a file system layer on top 
of a software storage layer a “storage protocol stack,” 
akin to networking protocol stacks that are prominent 
in communication networks [8]. There are some simi- 
larities between the two: layering is known to simplify 
system design, though potentially at the cost of perfor- 
mance [47]. However, a crucial difference exists: the 
layers that comprise network protocol stacks are derived 
by design, with the architects carefully deciding where 
each specific element should be placed. The storage pro- 
tocol stack, however, has not been developed in a single, 
coherent manner; the end result is not only poor perfor- 
mance but also the potential for duplication in imple- 
mentation and limitations on functionality. 

For example, performance may suffer if the model 
that the file system has of the storage layer is not 
accurate; thus, layout optimizations that work well 
on a single, traditional disk may not be appropriate 
when the logical-block to physical-block mapping 1s un- 
known [51]. Feature duplication is also a potential pit- 
fall. For example, a log-structured file system [30] could 
be layered on top of a disk array that performs log- 
ging (40, 51], duplicating work and increasing system 
complexity unnecessarily. Finally, functionality may be 
limited, as certain pieces of information only live at one 
layer of the system. For example, the storage system 
does not know what blocks constitute a file and thus can- 
not perform per-file operations, and it does not know that 
a block is no longer live after a file deletion, and thus 
cannot optimize the system in ways possible had that 
knowledge been available. 

Thus, we believe that the time is ripe to re-examine 
the division of labor between the file system and stor- 
age system layers, in an attempt to understand the best 
way to structure the storage protocol stack. Specifically, 
for each piece of storage functionality, we wish to un- 
derstand where it is most easily and effectively imple- 
mented. We believe the problem is particularly germane 
at this time, with the move towards network-attached 
storage (and their proposed higher-level disk interfaces) 
under way [14]. 

In this paper, we take a first step towards our goal 
by exploring a single point in the spectrum of possi- 
ble designs. To bridge the file system/storage system 
information gap, we develop and evaluate a new divi- 
sion of labor between the file systemand storage. In this 
realignment, the storage layer exposes parallelism and 
failure isolation boundaries in part or full to file systems 
built above, and provides on-line performance and fail- 
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ure characteristics. We call this layer the Exposed RAID 
layer (EXRAID). 

To take advantage of the information provided by 
ExRAID, we introduce an Infonned LFS (1-LFS), an 
enhancement of a log-structured file system [30, 37]. 
By combining the performance and failure information 
presented by ExRAID along with file-system specific 
knowledge, I-LFS is more flexible and manageable than 
a traditional file system, and can deliver higher perfor- 
mance and availability as well. For example, adding 
a disk to I-LFS on-line is easily accomplished; fur- 
ther, ILLFS accounts for the potential heterogeneity in- 
troduced by a new disk, and dynamically balances load 
across the disks of the system, whatever their rates. 
I-LFS also increases the flexibility of storage by en- 
abling user control over redundancy on a per-file ba- 
sis, and implements lazy mirroring to defer replication 
to a later time, potentially increasing performance of 
the system at a slight decrease in reliability. Crucial to 
I-LFS/Ex RAID is the implementation of the aforemen- 
tioned benefits without a significant increase in overall 
complexity (and thus maintainability) of the storage pro- 
tocol stack. Viacareful design, all the functionality men- 
tioned above is implemented with only a 19% increase 
in overall code size as compared to a traditional system. 

However, I-LFS/EXRAID is not a panacea. In partic- 
ular, we find that managing redundancy within the file 
system can be somewhat onerous, requiring the care- 
ful placement of inodes and data blocks to ensure ef- 
ficient operation under failure. Further, extending the 
traditional file system structure to support the enhanced 
functionality of I-LLFS was sometimes an arduous task; 
perhaps a redesign of the age-old vnode layer to support 
informed file systems is warranted. 

The rest of the paper is structured as follows. We be- 
gin with a discussion of related work in Section 2. In 
Section 3, we give an overview of our approach, and then 
we describe Ex RAID and I-LFS in Sections 4 and 5, re- 
spectively. Then, in Section 6, we present an evaluation 
of our system. We present a discussion in Section 7, fu- 
ture work in Section 8, and conclude in Section 9. 


2 Related Work 


Part of our motivation for “informing” the file sys- 
tem of the nature of the storage system is reminiscent 
of work on the Berkeley Fast File System (FFS) [23]. 
FFS is an early demonstration of the benefits of hav- 
ing a low-level understanding of disk technology; by co- 
locating correlated inodes and data blocks, performance 
was improved, especially as compared to the old Unix 
file system. Our work has the same goal, but with multi- 
disk storage systems in mind; however, we believe that 


USENIX Association 


USENIX Association 


the file system should base its decisions upon reliably- 
obtained information about the characteristics of stor- 
age, instead of relying upon assumptions which may or 
may not hold across time (e.g., that seek costs dominate 
rotational costs). 

Roselli et al. discuss the file system/storage system 
gap in their talk on file system fingerprinting [29]. Their 
solution is to enrich the interface between file systems 
and storage systems, by giving the storage system more 
information about which blocks are related, and which 
blocks are likely to be accessed again in the near future. 
Thus, their approach gives the storage system some of 
the information that the file system might have collected, 
and presumes that the storage layer can make good use 
of such information. One potential problem with such an 
approach is that it may require agreement on a particular 
set of interfaces among cooperating storage vendors and 
file-system implementors. 

Another example of the benefits of low-level knowl- 
edge of disk characteristics is found in Schindler et al.’s 
recent work on wack-aligned extents [36]. Therein, the 
authors explore the range of performance improvements 
possible when allocating and accessing data on disk- 
track boundaries, thereby avoiding rotational latency and 
track-crossing overheads ina single-disk setting. In con- 
trast, ExRAID exposes disk boundaries of a RAID to 
file systems above, and not such detailed lower-level in- 
formation; in the future, it would be interesting to inves- 
tigate the benefits of having lower-level knowledge of 
the specifics of a RAID-based storage system. 

Network Appliance pioneered some of the ideas we 
discuss here in their work on file server appliances [16]. 
Inthe development of WAFL, a write-anywhere file lay- 
out technique, Hitz et al. hint at how some information 
normally hidden inside of the RAID layer can be taken 
advantage of by a file system. For example, they ensure 
that writes to the RAID-4 layer occur in full-stripe-sized 
units, and thus avoid the small-write penalty that nor- 
mally manifests itself on RAID-4 and RAID-S5 systems. 
We take this a step further by formalizing the Ex RAID 
layer, showing that a traditional file system can easily be 
modified to take advantage of the information provided 
by Ex RAID, and demonstrating that a broader range of 
optimizations are attainable within such a framework. 

Volume managers have long been used to ease the 
management of storage across multiple devices [44]. 
The EXRAID layer is simply a new type of volume 
manager that exposes more information to file systems 
(specifically, on-line performance and failure informa- 
tion); further, ExRAID is built with the presupposition 
that a single mounted file system will utilize multiple 
“volumes” for its data, whereas most volume managers 
assume that there is a one-to-one mapping between each 
mounted file system and a volume. One volume manager 


that is similar to Ex RAID is the Pool Driver, a volume 
manager for SANs that has a “sub-pool” concept which 
may be used by a file system to group related data [43]. 
In that work, the GFS file system uses sub-pools to sep- 
arate journaled meta-data and normal user data. 

Exposing each disk of a storage system to the file sys- 
tem is an extension of the arguments made by Engler and 
Kaashoek [12]. Therein, the authors argue that software 
abstractions made by operating systems are fundamen- 
tally problematic, as they are often too high-level and 
thus may limit power and functionality. The authors ad- 
vocate a solution of exposing all hardware features to the 
user. Missing from this argument for minimalism is the 
observation that hardware itself often provides abstrac- 
tions that users (and operating systems) cannot change. 
Apropos to data storage, the abstraction put forth by 
RAID systems is a particularly high-level one, which 
ExRAID breaks by revealing information that is often 
hidden from the file system. 

Some distributed file systems such as Zebra [15] and 
XFS [1] manage each disk of the system individually, 
in a manner similar to I-LFS. However, both of these 
systems use traditional storage management techniques 
(such as RAID-S striping) and do not take advantage of 
the many potential possibilities that the Ex RAID layer 
makes available. In the future, we hope to extend some 
of our ideas into the distributed arena, and thus allow for 
a more direct comparison. 

More recently, the NASD object interface has been 
introduced as a higher-level data repository for SAN- 
based distributed file systems [14]. This interface allows 
more advanced functionality to be placed into the stor- 
age layer, whereas EXRAID 1s designed to allow more 
functionality to be placed within the file system. Earlier 
work at HP on DataMesh also proposes more sophisti- 
cated interfaces for network-attached storage [50]. 

Our informed approach is also similar to a large body 
of work in parallel file systems [17, 24]. Most parallel 
file systems expose disk parallelism, but they allow the 
application itself, and not the file system, to manage it. 
Better control over redundancy in a parallel file system 
has also been proposed [9]. In that work, the compu- 
tation of parity is put under user control, and in doing 
so, allows the user to avoid the well-known perf ormance 
penalty of RAID-4 and RAID-5 under small writes. 


3 Overview 


In the next two sections, we present the design and 
implementation of EXRAID and I-LFS. Our primary 
goal in designing the system Is to exploit the informa- 
tion made available by EXRAID, thus allowing I-LFS 
to implement functionality that would be difficult or 1m- 
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possible to achieve in a more traditional layering. In par- 
ticular, we aim to increase: (1) the ease of storage man- 
agement, (2) performance, especially when considering 
multiple heterogeneous disks, and (3) functionality, so 
as to meet the demands of a diverse set of applications. 

Our primary goal in implementing ExRAID is to fa- 
cilitate the use of the information provided by ExRAID 
in the simplest possible way, and to allow non-informed 
legacy file systems to be built on top of EXxRAID with 
no changes. Our primary goal in implementing I-LFS 
is tO minimize the impact of transforming the file sys- 
tem to utilize the new storage interface. For example, 
changes that would require a re-design of the vnode 
layer were ruled out, as that would mandate that all other 
file systems be changed in order to function in our sys- 
tem. Thus, throughout our implementation effort, we 
integrate changes into I-LFS in a highly localized and 
modular fashion — the fewer lines of code that changed, 
the better. 

One question that must be addressed is our decision to 
modify LFS and not a more traditional (or perhaps more 
popular) FFS-like or journaling file system. One reason 
we chose LFS is its natural flexibility in data placement; 
LFS is a modem example of a “write anywhere” storage 
system [16, 19]. Write-anywhere systems provide an ex- 
tra level of indirection such that writes can be placed in 
any location on the storage medium, and we exploit this 
aspect of LFS in part of our implementation. However, 
with this in mind, we do believe that a number of our 
implementation techniques are general and could be ap- 
plied to other file systems, and hope to investigate doing 
so in the future. Those interested in general LFS file 
system performance issues should consult the work of 
Rosenblum and Ousterhout [30], or subsequent research 
by Seltzer et al. (37, 38]. 

All of our software was developed within the context 
of the NetBSD 1.5 operating system. EXRAID was im- 
plemented as a set of hooks on the lower-level block- 
driver calls, and is described in more detail in Section 4. 
I-LFS was implemented by extending the NetBSD ver- 
sion of LFS, which is based on the original LFS for BSD 
Unix [37], and is described in detail in Section 5. We 
chose the NetBSD version of LFS as it is known to bea 
relatively stable and solid implementation. 


4 ExRAID 


We now describe the ExRAID storage interface. It 
consists of two major components: a segmented address 
space which exposes some or all of the parallelism of the 
storage system to the file system, and functions used to 
inform the file system of the dynamic state of the storage 
system. 
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Figure |: An Example ExRAID Configuration. The 
dtagram depicts an example ExRAID configuration in which 
each of two disks ts combined into a mirrored pair. Two re- 
gions, each half of the size of the total address space, are pre- 
sented to the client file system. Within a region, the layout 
performed by the mirror ts hidden from the file system. 


4.1 A Segmented Address Space 


A traditional RAID array presents the storage subsys- 
tem to the file system as a linear array of blocks, un- 
demeath of which the true complexity of the particu- 
lar RAID scheme is hidden. File systems interact with 
RAID systems by either reading or writing the blocks. 
In keeping with our desire to minimize change and pre- 
serve backwards compatibility, ExXRAID also provides 
a linear array of blocks which can be read or written as 
the basic interface. 

However, because we wish to expose information 
about the storage system to the file system, the address 
space is segmented; specifically, it is organized as a se- 
ries of contiguous regions, each of which is mapped di- 
rectly to a single disk (or set of disks), and these region 
boundaries are made known to the file system above, if 
it so desires. For example, in a four-disk storage system 
with each disk capable of storing NV blocks, the address 
space EXRAID presents might be segmented as follows: 
blocks 0 through N — 1 map to disk 0, blocks N through 
2.N — 1 map to disk 1, and so forth. 

By exposing this information, ExRAID enables the 
file system to understand the performance and failure 
boundaries of the storage system. As we shall see in 
later sections, the file system can take advantage of this 
to place data on a particular region more intelligently, 
potentially improving performance, reliability, or other 
aspects of the storage system. 

Within EXRAID, a region may represent more than 
Just a single disk. For example, a region could be con- 
figured to represent a mirrored pair of disks, or even a 
RAID-5 collection. Thus, each region can be viewed 
as a configurable software-based RAID, and the entire 
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ExRAID address space as a single representation of the 
conglomeration of such RAID subsystems. In such a 
scenario, some information is hidden from the file sys- 
tem, but cross-region optimizations are still possible, if 
more than one region exists. Anexample of an ExRAID 
configuration over mirrored pairs is shown in Figure 1. 

Allowing each region to represent more than just a 
single disk has two primary benefits. First, if each re- 
gion is configured as a RAID (such as a mirrored pair 
of disks), the file system is not forced to manage redun- 
dancy itself, though it can choose to do so if so desired. 
Second, this arrangement allows for backwards compati- 
bility, as Ex RAID can be configured as a single striped, 
mirrored, or RAID-5 region, thus allowing unmodified 
file systems to use it without change. 


4.2 Dynamic Information 


Although the segmented address space exposes the 
nature of the underlying disk system to the file system 
(either in part or in full), this knowledge is often not 
enough to make intelligent decisions about data place- 
ment or replication. Thus, the ExXRAID layer exposes 
dynamic information about the state of each region to 
the file system above, and it is inthis way thatExRAID 
distinguishes itself from traditional volume managers. 

Two pieces of information are needed. First, the file 
system may desire to have performance information on 
a per-region basis. The EXRAID layer tracks queue 
lengths and current throughput levels, and makes these 
pieces of information available to the file system. His- 
torical tracking of information ts left to the file system. 

Second, the file system may wish to know about the 
resilience of each region, i.e., when failures occur, and 
how many more failures a region can tolerate. Thus, 
EXxRAID also presents this information to the file sys- 
tem. For example, in Figure 1, the file system would 
know that each mirror pair could tolerate a single disk 
failure, and would be informed when such a failure oc- 
curs. The file system could then take action, perhaps by 
directing subsequent writes to other regions, or even by 
moving important data from the “bad” region into other, 
more reliable portions of the Ex RAID address space. 


4.3 Implementation 


In our current implementation, EXRAID is imple- 
mented as a thin layer between the file system and the 
storage system. In order to implement a striped, mir- 
rored, or RAID-5 region, we simply utilize the standard 
software RAID layer provided with NetBSD. However, 
our prototype Ex RAID layer is not completely general- 
ized as of this date, and thus in its current form would re- 
quire some effort to allow a file system other than I-LFS 
to utilize it. 


The segmented address space is built by interposing 
on the vnode strategy call, which allows us to remap re- 
quests from their logical block number within the virtual 
address space presented by Ex RAID intoa physical disk 
number and block offset, whichcan then be issued to un- 
derlying disk or RAID. 

Dynamic performance information is collected by 
monitoring the current performance levels of reads and 
writes. In the prototype, region boundaries, failure infor- 
mation, and performance levels (throughput and queue 
length) are tracked in the low-levels of the file system. 
A more complete implementation would make the infor- 
mation available through an ioct] () interface to the 
EXRAID device. Also note that we focus primarily on 
utilizing the performance information in this paper. 


5 I-LFS 


We now describe the I-LFS file system. Our current 
design has four major pieces of additional functionality, 
as compared to the standard LFS: on-line expandability 
of the storage system, dynamic parallelism to account 
for performance heterogeneity, flexible user-managed 
redundancy, and lazy mirroring of writes. In sum to- 
tal, these added features make the system more manage- 
able (the administrator can easily add a new disk, with- 
out worry of configuration), more flexible (users have 
control over if replication occurs), and have higher per- 
formance (I-LFS delivers the full bandwidth of the sys- 
tem even in heterogeneous configurations, and flexible 
mirroring avoids some of the costs of more rigid redun- 
dancy schemes). For most of the discussion, we focus on 
the case that most separates |. LFS/EXRAID from a tra- 
ditional RAID, where the ExXRAID layer exposes each 
disk of the storage system as a separate region to I-LFS. 


5.1 On-Line Expansion and Contraction 


Design: The ability to upgrade a storage system in- 
crementally is crucial. As the performance or capacity 
demands of a site increase, an administrator may need 
to add more disks. Ideally, such an addition should be 
simple to perform (e.g., a single command issued by the 
administrator, or an automatic addition when the disk is 
detected by the hardware), require no down-time (thus 
keeping availability of storage high), and immediately 
make the extra performance and capacity of the new disk 
available. 

In older systems, on-line expansion is not possible. 
Even if the storage system could add a new disk on-the- 
fly, it is likely the case that an administrator would have 
to unmount the partition, expand it (perhaps with a tool 
similar to that described in [46]), and then re-mount the 
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file system. Worse, some systems require that a new file 
system be built, forcing the administrator to restore data 
from tape. More modern volume managers [48] allow 
for on-line expansion, but still need file system support. 
Thus, our I-LFS design includes the ability to incor- 
porate new disks (really, new Ex RAID regions) on-line 
with a single command givento the file system. No com- 
plicated support is necessitated across many layers of the 
system. If the hardware supports hot-plug and detection 
of new disks without a power-cycle, I-LFS can add new 
disks without any down time and thus reduction in data 
availability. Overall, the amount of work an administra- 
tor must put forth to expand the system is quite small. 
Contraction is also important, as the removal of a re- 
gion should be as simple as the addition of one. There- 
fore, we also incorporate the ability to remove a region 
on the fly. Of course, if the file system has been config- 
ured in a non-redundant manner, some data will likely be 
lost. The difference between I-LFS and a traditional sys- 
tem in this scenario is that I-LLFS knows exactly which 
files are available and can deliver them to applications. 


Implementation: To allow for on-line expansion and 
contraction of storage, the file system views regions that 
have not yet been added as extant and yet fully utilized; 
thus, when a new region is added to the system, the 
blocks of that disk are made available forallocation, and 
the file system will immediately begin to write data to 
them. Conversely, a region that is removed is viewed as 
fully allocated. This technique is general and could be 
applied to other file systems, and similar ideas have been 
used elsewhere [16]. 

More specifically, because a log-structured file sys- 
tem is composed of a collection of LFS segments, it 
is natural to expand capacity within I-LFS by adding 
more free segments. To implement this functionality, the 
newfs_ilfs program creates an expanded LFS seg- 
ment table for the file system. The entries in the segment 
table record the current state of each segment. When 
a new ExRAID region is added to the file system, the 
pertinent information is added to the superblock, and 
an additional portion of the segment table is activated. 
This approach limits the number of regions that can be 
added to a fixed number (currently, 16); for more flexi- 
ble growth, the segment table could be placed in its own 
file and expanded as necessary. 


5.2. Dynamic Parallelism 


Design: One problem introduced by the flexibility an 
administrator has in growing a system is the increased 
potential for performance heterogeneity in the disk sub- 
system; in particular, a new disk or EXRAID segment 
may have different performance characteristics than the 
other disks of the system. In such a case, traditional 
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striping and RAID schemes do not work well, as they 
all assume that disks run at identical rates (4, 10]. 

Traditionally, the presence of multiple disks is hidden 
by the storage layer from the file system. Thus, current 
systems must handle any disk performance heterogene- 
ity in the storage layer — the file system does not have 
enough information to do so itself. The research com- 
munity has proposed schemes to deal with static disk 
heterogeneity [3, 10, 32, 52], though many of these so- 
lutions require careful tuning by an administrator. As 
Van Jacobsen notes, “Experience shows that anything 
that needs to be configured will be misconfigured”’ [18]. 

Further complicating the issue is that the delivered 
performance of a device could change over time. Such 
changes could result from workload imbalances, or per- 
haps from the “fail-stutter’” nature of modern devices, 
which may present correct operation but degraded per- 
formance to clients [5]. Even if more advanced hetero- 
geneous data layout schemes are utilized, they will not 
work well under dynamic shifts in performance. 

To handle such static and dynamic performance dif- 
ferences among disks, we include a dynamic segment 
placement mechanism within I-LFS [4]. A segment can 
logically be written to any free space in the file system; 
we exploit this by writing segments to Ex RAID regions 
in proportion to their current rate of performance, ex- 
ploiting the dynamic state presented to the file system by 
ExRAID. By doing so, we can dynamically balance the 
write load of the system to account for static or dynamic 
heterogeneity in the disk subsystem. Note that if perfor- 
mance of the disks is roughly equivalent, this dynamic 
scheme will degenerate to standard RAID-O striping of 
segments across disks. 

This style of dynamic placement could also be per- 
formed in a more traditional storage system (e.g., Au- 
toRAID has the basic mechanisms in place to do 
so [5 1]). However, doing so unduly adds complexity into 
the system, as both the file system and the storage sys- 
tem have to track where blocks are placed; by pushing 
dynamic segment placement into the file system, overall 
complexity is reduced, as the file system already tracks 
where the blocks of a file are located. 


Implementation: The original version of LFS allocates 
segments sequentially based on availability; in other 
words, all free segments are treated equally. To better 
manage parallelism among disks in I-LFS, we develop a 
segment indirection technique. Specifically, we modify 
the ilfs.newseg() routine to invoke a data place- 
ment strategy. The ilfs.newseg() routine is used 
to find the next free segment to write to; here, we alter 
it to be “region aware’, and thus allow for a more in- 
formed segment-placement decision. By choosing disks 
in accordance with their performance levels (informa- 
tion provided by ExRAID), the load across a set of 
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heterogeneously-performing regions can be balanced. 

The major advantage of our decision to implement 
this functionality within the i1f£s.newseg() routine 
is that it localizes the knowledge of multiple disks to 
a very small portion of the file system; the vast major- 
ity of code in the file system is not aware of the region 
boundaries within the disk address space, and thus re- 
mains unchanged. The slight drawback is that the deci- 
sion of which region to place a segment upon is made 
early, before the segment has been written to; if the per- 
formance level of the disk changes as the segment fills 
ina significant way, the placement decision could poten- 
tially be a poor one. In practice, we have not found this 
to be a performance problem. 


5.3 Flexible Redundancy 


Design: Typically, redundancy is implemented in a 
one-size-fits-all manner, as a single RAID scheme (or 
two, as in AutoRAID) 1s applied to all the blocks of the 
storage system. The file system is typically neither in- 
volved nor aware of the details of data replication within 
the storage layer. This traditional approach is limiting, 
as much semantic information is available in the file 
system as well as in smart users or applications, which 
could be exploited to improve performance or better uti- 
lize capacity. 

Thus, in I-LFS, we explore the management of redun- 
dancy strictly within the file system, as managing redun- 
dancy in the file system provides greater flexibility and 
control to users. In our current design, we allow users or 
applications to select whether a file should be made re- 
dundant (in particular, if it should be mirrored). If a file 
is mirrored, users pay the cost in terms of performance 
and capacity. If a file is not mirrored, performance in- 
creases during writes to that file, and capacity is saved, 
but the chances of losing the file are increased. Turning 
off redundancy is thus well-suited for temporary files, 
files that can easily be regenerated, or swap files. 

Because I:LFS performs the replication, better ac- 
counting is also possible, as the system knows exactly 
which files (and hence which users) are using which 
physical blocks. In contrast, with a traditional file sys- 
tem mounted on top of an advanced storage system such 
as AutoRAID [51], users are charged based on the log- 
ical capacity they are using, whereas the true usage of 
storage depends on access patterns and usage frequency. 

Because redundancy schemes are usually imple- 
mented within the RAID storage system (where no no- 
tion of a file exists), our scheme would not easily be im- 
plemented in a traditionally-layered system. The storage 
system is wholly unaware of which blocks constitute a 
file and therefore cannot receive input from a user as to 
which blocks to replicate; only if both the file system 





Figure 2: The “Crossed Pointer” Problem. The figure 
illustrates the problem with using a separate file as a means 
for redundancy, specifically, even though each element of a 
file (inode, data block) has been replicated, a single lost disk 
could still make tt difficult to find a particular data block, due 
to the extra requirement that for each block, a pointer chain to 
the block must still be live. In the example, the file with inode 
number N and tts mirror, inode N + 1, consist of a single data 
block (block 0). If either disk crashes, it is not possible to find 
the corresponding data block, even though a copy of it exists 
on the remaining working disk. 


and storage system were altered could such function- 
ality be realized. In the future, it would be interesting 
to investigate a range of policies on top of our redun- 
dancy mechanisms that automatically apply different re- 
dundancy strategies according to the class of a file, akin 
to how the Elephant file system segregates files for dif- 
ferent versioning techniques [33]. 


Implementation: To accomplish our goal of per-file 
redundancy, we decided to utilize separate and unique 
meta-data for original and redundant files. This ap- 
proach is natural within the file system as it does not 
require changes to on-disk data structures. 

In our implementation, we use a straight-forward 
scheme that assigns even inode numbers to original files 
and odd inode numbers to their redundant copies. This 
method has several advantages. Because the original 
and redundant files have unique inodes, the data blocks 
can be distributed arbitrarily across disks (given certain 
constraints described below), thus allowing us to use re- 
dundancy in combination with our other file system fea- 
tures. Also, the number of LFS inodes is unlimited be- 
cause they are written to the log, and the inode map is 
stored in a regular file which is expanded as necessary. 
The prime disadvantage of our approach is that it lim- 
its redundancy to one copy, but this could easily be ex- 
tended to an V-way mirroring scheme by reserving NV 
i-numbers per file. 

One problem introduced by our decision to utilize 
separate inodes to track the primary and mirrored copy 
of a file is what we refer to as the “crossed pointer’ 
problem. Figure 2 illustrates the difficulty that can arise. 
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Simply requiring each component of a file (e.g., the in- 
ode, indirect blocks, and data blocks) be replicated is not 
sufficient to guarantee that all data can be recovered eas- 
ily under a single disk failure. Instead, we must ensure 
that each data block is reachable under a disk failure; a 
block being reachable implies that a pointer chain to it 
exists. 

Consider the example in the figure: a file with inode 
number JN is replicated within inode number N + 1. 
Inode N is located on the first disk, as is the first data 
block of the mirror copy (file NV + 1). Inode N + 1 is 
on the other disk, as is the first data block of the primary 
copy (file NV). However, if either disk fails, the first data 
block is not easily recovered, as the inode on the sur- 
viving disk points to the data block on the failed disk. 
In some file systems, this would be a fatal flaw, as the 
data block would be unrecoverable. In LFS, it is only a 
performance issue, as the extra information found within 
segment summary blocks allows for full recovery; how- 
ever, a disk crash would mandate a full scan of the disk 
to recover all data blocks. 

There are a number of possible remedies to the prob- 
lem. For example, one could perform an explicit repli- 
cation of each inode and all other pointer-carrying struc- 
tures, such as indirect blocks, doubly-indirect blocks, 
and so forth. However, this would require the on-disk 
format to change, and would be inefficient in its usage 
of disk space, as each inode and indirect block would 
have four logical copies in the file system. 

Instead, we take a much simpler approach of divide 
and conquer. The disks of the system are divided into 
two sets. When writing a redundant file to disk, I-LFS 
decides which set the primary copy should be placed 
within; the redundant copy is placed within the other set. 
Thus, because no pointers cross from either set into the 
other, we can guarantee that a single failure will cause 
no harm (in fact, we can tolerate any number of failures 
to disks in that set). 

Finally, incorporating redundancy into I-LFS also 
presents us with a difficult implementation challenge: 
how should we replicate the data and inodes within the 
file system, without re-writing every routine that cre- 
ates or modifies data on disk? We develop and apply 
recursive vnode invocation to ease the task. We em- 
bellish most I-LFS vnode operations with a short re- 
cursive tail; therein, the routine is invoked recursively 
(with appropriate arguments) if the routine is currently 
operating on an even i-number and therefore on the 
primary copy of the data, and if the file is designated 
for redundancy by the user. For instance, when a file 
is created using ilfSs.create(), a recursive call to 
ilfs_create () is used to create aredundant file. The 
recursion is broken within thecall to perform the identi- 
cal operation to the redundant file. 
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5.4 Lazy Mirroring 


Design: —_User-controlled replication allows users to 
control if replication occurs, but not when. As has been 
shown in previous work, many potential benefits arise in 
allowing flexible control over when redundant copies are 
made or parity is updated [9]. Delaying parity updates 
has been shown to be beneficial in RAID-5 schemes to 
avoid the small-write problem [34], and could also re- 
duce load under mirrored schemes. Implementing such 
a feature at the file system level allows the user to de- 
cide the window of vulnerability for each file, as losing 
data in certain files may likely be more tolerable than 
in others. Note that either of these enhancements would 
be difficult to implement in a traditional system, as the 
information required resides in both the file system and 
RAID, necessitating non-trivial changes to both. 

In I-LFS, we incorporate lazy mirroring into our user- 
controlled replication scheme. Thus, users can desig- 
nate a file as non-replicated, immediately replicated, or 
lazily replicated. By choosing a lazy replica, the user is 
willing to increase the chance of data loss for improved 
performance. Lazy mirroring can improve performance 
for one of two reasons. First, by delaying file replica- 
tion, the file system may reduce load under a burst of 
traffic and defer the work of replication to a later pe- 
riod of lower system load. Second, if a file is written to 
disk and then deleted before the replication occurs, the 
cost of replication is removed entirely. As most systems 
buffer files in memory for a short period of time (e.g., 30 
seconds), and file lifetimes have recently been shown to 
be longer than this on average [28], this second scenario 
may be more common than previously thought. 


Implementation: Lazy mirroring is implemented in 
I-LFS as an embellishment to the file-system cleaner. 
For files that are designated as lazy replicas, an extra 
bit is set in the segment usage table indicating their sta- 
tus. When the cleaner scans a segment and finds blocks 
that need to be replicated, it simply performs the repli- 
cation directly, making sure to place replicated blocks so 
as to avoid the “crossed pointer” problem, and associates 
them with the mirrored inode. When the replication is 
complete, the bit is cleared. Currently, the file system 
replicates files after a 2-minute delay, though in the fu- 
ture this could be set directly by the user or application. 


6 Evaluation 


In this section, we present an evaluation of Ex RAID 
and I-LFS. Experiments are performed upon an Intel- 
based PC with 128 MB of physical memory. The main 
processor is a 1-GHz Intel Pentium III Xeon, and the 
system houses four 10,000 RPM Seagate ST318305LC 
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Figure 3: Baseline Performance Comparison. The fig- 
ure plots the performance of I-LFS/E x RAID under sequential 
writes, sequential reads, random writes, and random reads. 
The tests are run on four disks, varying whether the disks used 
are the four slow disks or the four fast ones. In all cases, re- 
quests generated by the tests are 8 KB in size, and the total 
data-set size is 200 MB. 


Cheetah 36XL disks (which we will refer to as the “fast” 
disks), and four 7,200 RPM Seagate ST34572W Bar- 
racuda 4XL disks (the “slow” disks). The fast disks can 
deliver data at roughly 21.6 MB/s each, and the slow 
disks at approximately 7.5 MB/s apiece. For all exper- 
iments, we perform 30 trials and show both the average 
and standard deviation. 

In some experiments, we compare the performance 
of I-LFS/ExRAID to standard RAID-O striping. Stripe 
sizes are chosen so as to maximize performance of the 
RAID-0 given the workload at hand, making the com- 
parison as fair as possible, or even slightly unfair to- 
wards I-LFS/Ex RAID. 


6.1 Baseline Performance 


In this first experiment, we demonstrate the baseline 
performance of I-LFS/Ex RAID on top of two different 
homogeneous storage configurations, one with four slow 
disks, and one with four fast disks. The experiment con- 
sists of sequential write, sequential read, random write, 
and random read phases (based on patterns generated 
by the Bonnie [6] and IOzone [25] benchmarks). We 
perform this experiment to demonstrate that there is no 
unexpected overhead in our implementation, and that it 
scales to higher-performance disks effectively. 

As we can see in Figure 3, sequential write, sequential 
read, and random writes all perform excellently, achiev- 
ing high bandwidth across both disk configurations. Not 
surprisingly for a log-based file system, random reads 
perform much more poorly, achieving roughly 0.9 MB/s 
on the four slow disks, and 1.8 MB/s on the four fast 
disks, in line with what one would expect from these 
disks in a typical RAID configuration. 
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Figure 4: Storage Expansion. The graph plots the per- 
formance of I-LFS during storage expansion. The experiment 
begins with I-LFS writing to a single disk. Each time 256 MB is 
written, a new disk ts brought on-line, and I-LFS immediately 
begins writing to it for increased performance. Disk expansion 
is accomplished via a simple command, which adds the disk 
(or region) to the file system without down time. 


6.2 On-line Expansion 


We now demonswtrate the performance of the system 
under writes as disks are added to the system on-line. In 
this experiment, the disks are already present within the 
PC, and thus the expansion stresses the software infras- 
tructure and not hardware capabilities. 

Figure 4 plots the performance of sequential writes 
over time as disks are added to the system.! Along the 
X-axis, the amount of data written to disk is shown, and 
the y-axis plots the rate that the most recent 64 MB was 
committed to disk. As one can see from the graph, I-LFS 
immediately starts using the disks for write traffic as they 
are added to the system. However, read traffic will con- 
tinue to be directed to the original disks for older data. 
The LFS cleaner could redistribute existing data over the 
newly-added disks, either explicitly or through cleaning, 
but we have not yet explored this possibility. 


6.3 Dynamic Parallelism 


We next explore the ability of I-LFS to place segments 
dynamically in different regions based on the current 
performance characteristics of the system, in order to 
demonstrate the ability of I-LFS to react to static and 
dynamic performance differences across devices. 

There are many reasons for performance variation 
among drives. For example, when new disks are added, 
they can likely be faster than older ones; further, unex- 
pected dynamic performance variations due to bad-block 
remapping or “hot spots” in the workload are not un- 
common [5], and therefore can also lead to performance 


1Random writes perform similarly, due to the nature of LFS. 
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Figure 5: Static Storage Heterogeneity. The figure 


plots the performance of |-LFS versus FFS/CCD with standard 
RAID-O striping, both under a series of disk comftgurations. 
Along the x-axis, the number of fast and slow disks are varied 
(f:s implies f fast disks and s slow ones). By adjusting where 
segments are written dynamically, |-LFS/EXRAID is able to 
deliver the full bandwidth of disks. In contrast, standard strip- 
ing performs at the rate of the slowest disk in the system. For 
each test, 200 MB is written to disk. 


heterogeneity across disks. Indeed, the ability to ex- 
pand the disk system on-line (as shown above) induces a 
workload imbalance, as read traffic is not directed to the 
newly-added disks until the cleaner has reorganized data 
across all of the disks in the system. 

We experiment with both static and dynamic perfor- 
mance variations in this subsection. Figure 5 shows the 
results of our static heterogeneity test. The sequential 
write performance of I-LFS with its dynamic segment 
placement scheme is plotted along with FFS on top of 
the NetBSD concatenated disk driver (CCD) configured 
to stripe data in a RAID-O fashion. In all experiments, 
data 1s written to four disks. Along the x-axis, we in- 
crease the number of slow disks in the system; thus, at 
the extreme left, all of the four disks are fast ones, at 
the right they are all slow ones, and in the middle are 
different heterogeneous configurations. 

As we can see in the figure, by writing segments dy- 
namically in proportion to delivered disk performance, 
I-LFS/EXRAID is able to deliver the full bandwidth of 
the underlying storage system to applications — overall 
performance degrades gracefully as more slow disks re- 
place fast ones in the storage system. RAID-O striping 
performs at the rate of the slowest disk, and thus per- 
forms poorly in any heterogeneous configuration. 

We also perform a “misconfiguration” test. In this ex- 
periment, we configure the storage system to utilize two 
partitions on the same disk, emulating a misconfigura- 
tion by an administrator (similar in spirit to tests per- 
formed by Brown and Patterson [7]). Thus, while the 
disk system appears to contain four separate disks, it re- 
ally only contains three. In this case, I-:LFS/ExRAID 
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Figure 6: Dynamic Storage Heterogeneity. The figure 
plots the performance of |-LFS/EXRAID and FFS/CCD un- 
der a dynamic performance variation. During the experiment, 
the performance of a single disk is temporarily degraded; the 
faulty disk delays requests for a fixed time, reducing through- 
put of the disk from 21.6 MB/s to 5.8 MB/s. By adaptively 
writing more data to the other disks, |-LFS/EXRAID with dy- 
namic segment placement is better able to adjust to the imbal- 
ance and deliver higher throughput. 


writes data to disk at 65 MB/s, whereas standard striping 
delivers only 46 MB/s. The dynamic segment striping 
of I-LFS is successfully able to balance load across the 
disks, in this case properly assigning less load to each 
partition within the accidentally over-burdened disk. 

In our final heterogeneity experiment, we introduce 
an artificial “performance fault’ into a storage system 
consisting of four fast disks, in order to confirm that our 
load balancing works well in the face of dynamic perfor- 
mance variations. Figure 6 shows the performance dur- 
ing a write of both ILLFS/ExRAID with dynamic seg- 
ment placement and FFS/CCD using RAID-0 striping in 
a case where a single disk of the four exhibits a perfor- 
mance degradation. After one third of the data is written, 
a kernel-based utility is used to temporarily delay com- 
pleted requests from one of the disks. The delay has 
the effect of reducing its throughput from 21.6 MB/s to 
5.8 MB/s. The impaired disk is returned to normal oper- 
ation after an additional one third of the data is written. 
As we can see from the figure, I-LFS/ExXRAID does a 
better job of tolerating the fluctuations induced during 
the second phase of the experiment, improving perfor- 
mance by over a factor of two as compared to FFS/CCD. 


6.4 Flexible Redundancy 


In our first redundancy experiment, we verify the op- 
eration of our system in the face of failure. Figure 7 
plots the performance of a set of processes performing 
random reads from redundant files on I-LFS. Initially, 
the bandwidth of all four disks is utilized by balancing 
the read load across the mirrored copies of the data. As 
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Figure 7: Storage Failure. The figure plots the random 
read performance to a set of mirrored files across four disks on 
LLFS. At the labeled points in the graph, a disk is taken off- 
line, and performance decreases because I-LFS can no longer 
balance the read load between the replicas. Note that in this 
example, l-LFS/EXRAID can survive any single disk failure; 
however, after the first failure, ILLFS/EXRAID can only toler- 
ate the loss of the other disk in the set. 


the experiment progresses, a disk failure is simulated by 
disabling reads to one of the disks. I-LFS continues pro- 
viding data from the available replicas, but overall per- 
formance is reduced. 

Next, we demonstrate the flexibility of per-file redun- 
dancy when the redundancy is managed by the file sys- 
tem. A total of 20 files are written concurrently to a 
system consisting of four fast disks, while the percent- 
age of those files that are mirrored is increased along the 
x-axis. The results are shown in Figure 8. 

As expected, the net throughput of the system de- 
creases linearly as more files are mirrored, and when 
all are mirrored, overall throughput is roughly halved. 
Thus, with per- file redundancy, users “get what they pay 
for”; ifusers want a file to be redundant, the performance 
cost of replication is paid during the write, and if not, 
the performance of the write reflects the full bandwidth 
of the underlying disks. 


6.5 Lazy Mirroring 


In our final experiment, we demonstrate some of the 
per formance characteristics of lazy mirroring. Figure 9 
plots the write performance to a set of lazily mirrored 
files. After a delay of 20 seconds, the cleaner begins 
replicating data, and the normal file system traffic suf- 
fers from a small decline in performance. The default 
replication delay for the system is two minutes in length, 
but an abbreviated delay is used here to reduce the time 
of the experiments. 

From the figure, we can see the potential benefits of 
lazy mirroring, as well as its potential costs. If lazily 
mirrored files are indeed deleted before replication be- 
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Figure 8: Per-file Redundancy. The figure plots the 
performance of writes to 20 separate files as the percent of 
those files that are mirrored increases. As more files are mir- 
rored, the net bandwidth of the system drops to roughly half 
of its peak rate, as expected. The peak bandwidth achieved 
is lower than the previous experiments due to the increased 
number of files and subsequent meta-data operations. In each 
experiment, 200 MB is written out to disk. 


gins, the full throughput of the storage layer will be re- 
alized. However, if many or all lazily mirrored files are 
not deleted before replication, the system incurs an extra 
penalty, as those files must be read back from disk and 
then replicated, which will affect subsequent file system 
traffic. Therefore, lazy mirroring should be used care- 
fully, either in systems with highly bursty traffic (ie., 
idle time for the lazy replicas to be created), or with files 
that are easily distinguishable as short-lived. 


7 Discussion 


Jn implementing I-LFS/ExR AID, we were concerned 
that by pushing more functionality into the file system, 
the code would become unmanageably complex. Thus, 
one of our primary goals is to minimize code complex- 
ity. We believe we achieve this goal, integrating the three 
major pieces of functionality with only an additional 
1500 lines of code, a 19% increase over the original 
size of the LFS implementation. Of this additional code, 
roughly half is due to the redundancy management. 

From the design standpoint, we find that managing 
redundancy within the file system has many benefits, 
but also causes many difficulties. For example, to solve 
the crossed-pointer problem, we applied a divide-and- 
conquer technique. By placing the primary copy of a 
file into one of two sets, and its mirror in the other, we 
enable fast operation under failure. However, our so- 
lution limits data placement flexibility, in that once a 
file is assigned to a set, any subsequent writes to that 
file must be written to that set. This limitation affects 
performance, particularly under heterogeneous con figu- 
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Figure 9: Lazy Mirroring. The figure plots the write 
performance to a set of lazy redundant files on 1-LFS with a 
replication delay of 20 seconds. Peak performance ts achieved 
during the initial portion of the test, but performance is re- 
duced slightly as the cleaner begins replicating data. After the 
write test completes, the cleaner continues to replicate data in 
the background. 


rations where one set has significantly different perfor- 
mance characteristics than the other. Though we can re- 
lax these placement restrictions, e.g., by choosing which 
disks constitute a set on a per-file basis, the problem is 
fundamental to our approach to file-system management 
of redundancy. 

From the implementation standpoint, file-system 
managed redundancy 1s also problematic, in that the vn- 
ode layer is designed with a single underlying disk in 
mind. Though our recursive invocation technique was 
successful, it stretched the limits of what was possible in 
the current framework, and new additions or modifica- 
tions to the code are not always straightforward to imple- 
ment. To truly support file-system managed redundancy, 
a redesign of the vnode layer may be beneficial [31]. 


8 Future Work 


A number of possible avenues exist for future re- 
search. Most generally, we believe more organiza- 
tions of the storage protocol stack need to be explored. 
Which pieces of functionality should be implemented 
where, and what are the trade-offs? One natural follow- 
on is to incorporate more lower-level information into 
ExRAID; the main challenge when exposing new in- 
formation to the file system is to find useful pieces of 
information that the file system can readily exploit. 

Of course, most file service today spans client and 
server machines. Thus, we believe it is important to con- 
sider how functionality should be split across machines. 
Which portion of the traditional storage protocol stack 
should reside on clients, and which portion should reside 
on the servers? Researchers in distributed file systems 
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have taken opposing points of view on this, with systems 
such as Zebra [15] and xFS [1] letting clients do most 
of the work, whereas the Frangipani/Petal system places 
most functionality within the storage servers [21, 45]. 


We also believe cooperative approaches between the 
file system and storage system may be useful. For ex- 
ample, we found that implementing redundancy in the 
file system was sometimes vexing; perhaps an approach 
that shared the responsibility of redundancy across both 
file system and storage layer would be an improvement. 
For example, the storage layer could tell the file system 
which block to use as a mirror of another block, but the 
file system could decide when to perform the replication. 


Even if we decide upon a new storage interface, it 
may be difficult to convince storage vendors to move 
away from the tried-and-true standard SCSI interface to 
storage. Thus, a more pragmatic approach may be to 
treat the RAID layer as a gray box, inferring its charac- 
teristics and then exploiting them in the file system, all 
without modification of the underlying RAID layer [2]. 
Tools that automatically extract low-level information 
from disk drives, such as DIXtrac [35] and SKIPPY [42], 
are first steps towards this goal, with extensions needed 
to understand the parallel aspects of storage systems. 


Finally, we envision many more possible optimiza- 
tions in our new arrangement of the storage protocol 
stack. For example, we are currently exploring the no- 
tion of intelligent reconstruction. The basic idea 1s sim- 
ple: if a disk (or region) fails, and I-LFS has duplicated 
the data upon that disk, I-LFS can begin the reconstruc- 
tion process itself. The key difference is that I'-LFS will 
only reconstruct live data from that disk, and not the en- 
tire disk blindly, as a storage system would, substantially 
lowering the time to perform the operation. A fringe 
benefit of intelligent reconstruction is that IL LFS should 
be able to give preference to certain files over others, re- 
constructing higher-priority files first and thus increasing 
the availability of those files under failure. 


We also imagine that many optimizations are possible 
with the LFS cleaner. For example, as data is laid out 
on disk according to current performance characteristics 
and access patterns, it may not meet the needs of subse- 
quent potentially non-sequential reads from other appli- 
cations. Similarly, as new disks are added, the cleaner 
may want to run in order to lay out older data across 
the new disks. Thus, the cleaner could be used to re- 
organize data across drives for better read performance 
in the presence of heterogeneity and new drives, similar 
to the work of Neefe et al., but generalized to operate in 
a heterogeneous multi-disk setting [22]. 
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9 Conclusions 


In terms of abstractions, block-level storage systems 
such as SCSI have been quite successful: disks hide 
low-level details from file systems such as the exact me- 
chanics of arm movement and head positioning, but stil] 
export a simple performance model upon which file sys- 
tems could optimize. As Lampson said: “{...] an inter- 
face can combine simplicity, flexibility, and high perfor- 
mance together by solving one problem and leaving the 
rest to the client” (20]. In early single-disk systems, this 
balance was struck nearly perfectly. 

As storage systems evolved from a single drive into a 
RAID with multiple disks, the interface remained sim- 
ple, but the RAID itself did not. The result is a system 
full of misinformation: the file system no longer has an 
accurate model of disk behavior, and the now-complex 
storage system does not have a good understanding of 
what to expect from the file system. 

Ex RAID and I-LFS bridge this information gap by 
design: the presence of multiple regions is exposed di- 
rectly to the file system, enabling new functionality. 
In this paper, we have explored the implementation of 
on-line expansion, dynamic parallelism, flexible redun- 
dancy, and lazy mirroring in I-LFS. All were imple- 
mented in a relatively straight-forward manner within 
the file system, increasing system manageability, perfor- 
mance, and functionality, while maintaining a reason- 
able level of overall system complexity. Some of these 
aspects of I-LFS would be difficult if not impossible to 
build in the traditional storage protocol stack, highlight- 
ing the importance of implementing functionality in the 
correct layer of the system. 

Though we have chosen a single point in the design 
space of storage protocol stacks, other arrangements are 
possible and perhaps even preferable; we hope that they 
will be explored. Whatever the conclusion of research 
on the division of labor between file and storage sys- 
tems, we believe that the proper division should be ar- 
rived upon via design, implementation, and thorough ex- 
perimentation, not via historical artifact. 
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Abstract 


Ina system offering on-demand real-time streaming 
of media files, data striping across an array of disks 
can improve load balancing, allowing higher disk utt- 
lization and increased system throughput. However, 
it can also cause complete service disruption in the 
case of a disk failure. Reliability can be improved 
by adding data redundancy and reserwng extra disk 
bandwidth during normal operation. In thts paper, 
we are interested in providing fault-tolerance for me- 
dia servers that support variable bit-rate encoding 
formats. Higher compression efficiency with respect 
to constant bit-rate encoding can significantly reduce 
per-user resource requirements, at the cost of in- 
creased resource management complexity. For the 
first tume, the interaction between storage system 
fault-tolerance and variable bit-rate streaming with 
deterministic QoS guarantees 1s investigated. We 
implement into a prototype server and experimen- 
tally evaluate, using detailed stimulated disk models. 
alternative data replication techniques and disk band- 
width reservation schemes. We show that with the 
minimum reservation scheme introduced here, sin- 
gle disk failures can be tolerated at «a cost of less 
than 20% reduced throughput during normal oper- 
atton, even for a disk array of moderate size. We 
also examine the benefit from load balancing tech- 
niques proposed for traditional storage systems and 


find only limited improvement in the measured through- 


put. 


1 Introduction 


Striping media files across multiple disks has the ad- 
vantage of keeping the disks implicitly load-balanced. 
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With several concurrent sessions of playback asyn- 
chronously started, different parts of each file are 
accessed from different disks. As a result it is no 
longer necessary to replicate media files according 
to their popularity, which leads to lower resource 
requirements and system administration cost. The 
disadvantage of disk striping is decreased system re- 
liability because media files are left partially inacces- 
sible when one or more disks fail. This causes service 
disruption to all the users served by the disk array 
at the moment of the failure. In contrast, when an 
entire file is stored on a single disk, only those users 
accessing files on the failed disk are negatively af- 
fected. 


Previous work has addressed the general problem of 
disk array reliability by using data redundancy tech- 
niques to allow recovery of inaccessible data [12]. 
Some of these techniques have been successfully ex- 
tended for handling the case of striped media files as 
well [6,25]. However, all the known analytical and 
expelimental work on this subject is either limited 
to streams of constant bit rates (CBR), or assumes 
stochastic admission control [27, 29]. 


Variable bit-rate (VBR) encoding of video can con- 
siderably reduce the size of the generated media files 


when compared to constant bit-rate encoding of equiv- 


alent perceptual quality [18,22]. In addition, knowl- 


edge about the resource requirements of stored streams 


during transmission can be leveraged for better pre- 
dicting access delays, and offering deterministic QoS 
guarantees [11]. Although striping of VBR streams 


has been previously studied, it remains unclear whether 


increased reliability can be provided with determin- 
istic QoS guarantees in cost-effective ways. 


Variability in the resource requirements over time 
makes efficient disk space allocation combined with 
access delay predictability a challenging task Data 
striping across multiple disks with sufficient redun- 
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Figure 1: Compressed video streams are stored across 
multiple disks of the media server. Multiple clents (or 
proxies) Can connect and start playback sessions via sep- 
arate network links. 


dancy to tolerate failures further aggravates the prob- 
lem due to the need for keeping balanced the storage 
space and bandwidth requirements across the disk 
array, under both normal-operation and failed-disk 
conditions. In the present paper, we describe a num- 
ber of data redundancy and bandwidth reservation 
schemes that can tolerate single disk failures with- 
out service interruption. We experimentally evalu- 
ate the cost of increased reliability in terms of re- 
duced throughput during the normal operation of 
the system using a video server prototype implemen- 
tation and MPEG-2 streams. We also investigate the 
achieved disk bandwidth utilization during normal 
and failed-disk operation. Additionally, we examine 
the extra benefit from retrieving data replicas stored 
on the least loaded disks, and from fragmenting data 
replicas across multiple disks. 


The rest of this paper is organized as follows. In 
Section 2, we describe basic assumptions and archi- 
tectural decisions of our system. In Sections 3, 4 
and 5, we introduce alternative policies for repli- 
cating stream data and reserving disk bandwidth 
for improved reliability. In Section 6, we briefly 
present our prototype implementation and the ex- 
perimentation environment that we use. In Section 
7. we compare the performance of different replica- 
tion techniques under alternative bandwidth reser- 
vation schemes and load balandng enhancements. 
In Section 8, we discuss possible improvements and 
extensions, and in Section 9 we summarize our con- 
clusions. 


2 System Architecture 


In the present section, we describe the system archi- 
tecture along with important resource management 
and reservation techniques that we use [2], 
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2.1 Overview 


The operation of our media server is typical in cur- 
rent system designs. Client devices submit playback 
requests concurrently to the server. The system 
is assumed to operate according to the server- push 
model. When a playback session starts, the server 
penodically sends data to the client until either the 
end of the stream is reached, or the client expliatly 
requests suspension of the playback. Data transfers 
occur in rounds of fixed duration Ty,.,¢. In each 
round, an appropnate amount of data is retrieved 
from the disks into a set of server buffers reserved 
for each active client. Concurrently. data are sent 
from the server buffers to the client through the net- 
work interfaces (Figure 1). The amount of stream 
data periodically sent to the client is determined by 
the decoding frame rate of the stream, the buffering 
constraints of the rece ver, and the resource manage- 
ment policy of the network. As a minimum require- 
ment. the client should receive in each round the 
amount of data that will be needed by the decoder 
during the next round. 


The streams are compressed according to any encod- 
ing scheme that supports constant quality quanti za- 
tion parameters and variable bit rates. Playback re- 
quests arriving from the clients are initially directed 
to an admission control module, where it is deter- 
mined whether enough resources exist to activate 
the requested playback session cither immediately 
or within a limited number of rounds. A schedule 
database maintains for cach stream information on 
how much data needs to be accessed from each disk 
in any given round, the amount of server buffer space 
required, and how much data needs to he transferred 
to the client. This scheduling information is gener- 
ated when the media stream is first stored, and is 
used for both admission control and traiusfer of data 
during playback. 


2.2 Stride-Based Disk Space Allocation 


In our experiments, we use a method called strzde- 
based allocation for allocating disk space [3]. In 
stride-based allocation, disk space is allocated in 
large, fixed-sized chunks called strides. The strides 
are chosen larger than the maximum stream request 
size per disk during a round. This size is known 
a priori, since stored streams are accessed sequen- 
tially according t:o a predefined (generally variable) 
rate. A stride may contain data of more than one 
round. When a stream is retrieved, only the re 
quested amount of data is fetched to memory during 
a round, and not the entire stride 





USENIX Association 


USENIX Association 


Stride-based allocation climinates external fragmen- 
tation due to the fixed-size strides. Internal frag- 
mentation remains negligible because of the large 
size of the streams relative to strides. Another ad- 
vantage of stridebased allocation is that it sets an 
upper-bound on the estimated disk access overhead 
during retrieval. Since the size of a stream request 
never exceeds the stride size during a round, at most 
two partial stride accesses will be required to serve 
the request of a round on each disk. 


2.3 Reservation of Server Resources 


A mathematical abstraction of the resource require 
ments is necessary for scheduling streams. We con- 
sider a system with D finctionally equivalent disks. 
Data of cach stream are stored as sequences of strides 
on each disk. Each stride comprises an integer num- 
ber of consecutive logical blocks with fixed size B;. 
The logical block size is a multiple of the physical 
sector size B, of the disk. Both the disk transfer re- 
quests and the memory buffer reservations are spec- 
ified in multiples of the block size B;. The Disk 
Striping Sequence Sq of length L,; determines the 
amount of data Su(z,k),O <7 < Leg —1, that are 
retrieved from disk 4, 0 < & < D—1, in round 2. 


We assume that each disk has edge to edge seek time 
Tyuttseck, Single-track seck time Tipuckseck, AVEFAgC 
rotational latency Ty.gro1, and minimum internal 
transfer rate Ruj.4. The stride-based disk space al- 
location policy enforces an upper bound of at most 
two disk arm movements per disk for each client per 
round. The total seek distance can also be limited 
using a Circular SCAN disk scheduling policy. Let 
Af; be the number of active streams during round ¢ of 
the system operation, and/; the round of system op- 
eration that the playback of stream j, 1 < 7 < Mj, 
started. Then, the total access time on disk é: in 
round 2 of the system operation can be approximated 
by the following expression: 


Taisk (4, k) =2T punseck ain 2M; : (Ti rackSeck + Te igi) 


M, 
+ ys “(i = Ug kt) / Ratisk 
=] 


where si is the disk striping sequence of client 7. 
The parameter T f.:tiSces 1S Counted twice due to the 
disk arm movement from the C-SCAN policy, while 
the factor two in the second term is due to the 
stride-based allocation scheme we use. The first 
term should be accounted for only once in the time 
reservation of each disk, but each client 7 incurs an 


extra access time of 


ice (2, k) =2- (TtruckSeck al Lvarunltet') 
+ Sai —bj,k)/ Raick 


on disk & during round i, when S4(i — 1;,k) > 0, 
and zero otherwise. Reservations of network band- 
width and buffer space are more straightforward. 
and based on the network and buffer sequence of 
each accepted playback request, respectively. 


2.4 Variable-Grain Striping 


For striping streani data across multiple disks, we 
use the Variable-Grain Striping policy. Data con- 
sumed by a client during a playback round is stored 
on (and accessed from) a single disk, while differ- 
ent disks are visited in round-robin fashion during 
successive rounds of a stream playback. When com- 


pared against alternative striping techniques, variable- 


grain striping demonstrates significant performance 
advantage due to i) reduced disk access overhead 
from accessing at most one disk per stream in a 
round, and ii) improved disk bandwidth utilization 
by statistically multiplexing I/O requests of different 
dzes from concurrently served streams [2]. 


% Data Redundancy Policies 


Due to the large number of components involved, 
it is necessary to assume device failures during the 
ifetime of a typical commercial server installation. 
With the estimated Mean Time To Failure of a mod- 
ern disk at about A4TT'Fa;5; = 1,200.000 hours, 
combining D = 1024 disks results in ATT Furry = 
MILT Mitisk — AQ days, assuming failure independence 
among different devices [26].' A typical system with 
1024 drives could support about 51,000 concurrent 
playbacks of 5 Mbit/s average bit rate cach.? Al- 
though the disks are likely to be distributed across 
multiple independent servers, building a single large 
disk array from distributed components has also been 
demonstrated in the past for media streaming ser- 
vices [9, 19]. 


In order to provide higher syst.em reliability, data re- 
dundancy techniques can be used. However, a prac- 
tical solution should minimize the extra computa- 


The calculation assumes Seagate Cheetah !8GB Ultral60 
SCSI disks with 31 MB/s fermatted minimum internal trans- 
fer rate (28). 

“These numbers are realistic in light of the popularity of 
similar services. For example. the number of cable television 
subscribers in 1998 was estimated to exceed 65 million in the 
US alone [13]. 
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Figure 2: Deterministic Replica Piacement. Data 
of a media file stored consecutively on disk 0 and re- 
trieved during different playback rounds are replicated 
round-robin across the other disks. The primary data of 
the other disks are replicated in a similar way. 


tion, storage and bandwidth requirements with re- 
spect to the non-redundant case. The present study 
focuses on single disk failures which are the most 
common. Multiple disk failures are less likely to oc- 
cur simultaneously [12], and possible ways for han- 
dling them are described briefly later. 


In the past, several parity-based techniques have 
been proposed that store error-correcting code for 
the data blocks of different disks [12]. When a disk 
fails, redundant infermation available on the surviv- 
ing disks is used to recover the missing data blocks. 
Parity-based techniques trade extra disk bandwidth 
or memory buffer requirements for reduced storage 
space. Since disk storage space currently has the 
lowest cost of the three resources, it has been sug- 
gested that replication rather than parity is the pre- 
ferred teclinique for tolerating disk failures [10, 17}. 
Furthermore, implementation of parity-based data 
recovery in a distributed architecture requires addi- 
tional data traffic among different nodes {9]. This 
can introduce significant extra complexity and re- 
source requirements in terms of network bandwidth 
and buffer space. For the above reasons, we do not 
consider parity-based techniques any further here. 


With mirroring techniques, the data of each disk are 
replicated on one or more different disks. We refer 
to the original copy of the data as primary and the 
additional copy as backup. Although the two copies 
can be used symmetrically, distinct placement polli- 
cies can be applied to each of them as we describe 
shortly. When one disk fails, its data remain avail- 
able by retrieving their backup replicas from the rest 
of the disks. The required storage space is roughly 
doubled and the needed bandwidth from each disk is 
at most twice that of the non-redundant case. The 
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Figure 3: Random Replica Placement. Data of a 
media file stored consecutively on disk 0 and retrieved 
during different playback rounds are replicated on ran- 
domly chosen disks 1 to 3. The priniary data copies of 
disks 1 to 3 are replicated in a similar way. 


backup replica of each data block can be stored in 
its entirety on a different disk, which requires only 
one access in the case of failure and minimizes the 
access overhead. The alternative of declustering a 
backup replica across multiple disks can potentially 
better balance the extra access load, but incurs the 
additional overhead of multiple accesses in the case 
of a disk failure. 


3.1 Deterministic Replica Placement 


In previous work, we have demonstrated that variable- 
erain striping of media files leads to equally utilized 
disks under sequential playback workloads [2]. Al- 
though mirroring has previously been only used with 
data striped using fixed-size blocks, in principle it 
could be applied to variable-grain striping as well. 
During sequential playback of a media file with no 
failed disks, each disk is accessed every D rounds, 
where D is the total number of disks. In order to pre- 
serve the load-balancing property when a disk fails, 
data of a media file stored consecutively on each disk 
could be replicated round-robin across the remaining 
disks (or a subset of them). The unit of replication 
corresponds to data retrieved by a client during one 
round of playback.We call this mirroring approach 
Deterministic Replica Placement. In Figure 2. for 
example, disk 0 is shown to store stream data re- 
quested during rounds k-D, (k+1)-D, (k+2)-D and 
(A + 3)-D. The respective replicas are distributed 
round-robin among disks 1, 2 and 3. 


3.2 Random Replica Placement 


Intuitively, having replicas of one disk’s primary data 
distributed round-robin across the rest of the disks 
can keep the surviving disks equally utilized when 
one disk fails. An alternative replication approach 
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Figure 4: Mirroring Reservation. For each disk, 
there is a separate vector indexed by round number that 
accumulates the total estimated access time for retriev- 
ing primary and backup data in each round. 


would use some pseudo-random sequence for sped- 
fying the disks that store the backup copies of one 
disk’s primary data. An obvious constraint is that 
primary and backup copies are stored on different 
devices. The unit of replication corresponds to the 
data of a inedia file requested by a client in one round 
of playback. We call this mirroring technique Ran- 
dom Replica Placement. 


An example is shown in Figure 3, where backup 
copies of data requested in rounds k- D,...,(k + 
3) -D are randomly placed on disks 1 to 3. It has 
been previously suggested that random placement of 
primary and backup replicas across different disks 
is applicable to a wider range of workload types 
and can outperform striping policies with round- 
robin placement [27]. In a later section, we examine 
this arguinent in the particular case of variable bi t- 
rate streains by comparing it against deterministic 
replica placement. 


4 Disk Bandwidth Reservation 


Our goal in this section is to allocate resources in 
such a manner that service to accepted requests will 
not be interrupted during (single) disk failures. Re- 
trieving backup replicas of data stored on a failed 
disk requires extra bandwidth to be reserved in ad- 
vance across the surviving disks. This implies that 
the system will normally have to operate below full 
capacity. Alternatively, when a disk fails and no 
extra bandwidth has been reserved, service will be- 
come unavailable for a nuinber of active users with 
aggregate bandwidth requirements no less than the 
transfer capacity of one disk, assuming that data 
have been replicated as described previously. 


The net benefit from uninterrupted service during 
disk failures is equal to the difference between two 
measures. One is the cost of having users frustrated 
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Figure 5: Minimum Reservation. For each disk, we 
maintain D separate vectors indexed by round number. 
One of them accumulates access delays for retrieving pri 
mary data) The remaining D —1 vectors accumulate 
access delays for retrieving backup replicas that corre 
spond to primary cdlata stored on each of the other D—-1 
disks. In each round, the sum of the primary data access 
time and the maximum of the backup data access times 
is reserved on each disk. 


due to interrupted service from a failed disk. Its 
quantification would require determining the min- 
imum number of users negatively affected when a 
disk fails. Detailed study of this issue is left for fu- 
ture work. The other measure is the cost of rejecting 
user requests due to additionally reserved disk band- 
width during normal operation. In the rest of this 
paper, we descn be alternative approaches for reserv- 
ing disk bandwidth (or equivalently access time) and 
improving reliability in media servers that support 
variable bit-rate streams. Subsequently, we experi- 
mentally evaluate the actual cost of these approaches 
in terms of reduced system throughput during nor- 
mal system operation. 


In what we call Mirroring Reservation, disk band- 
width is reserved for both the primary and backup 
replicas of a media file during its playback (Fig- 
ure 4). At first glance, this seems to be a rea- 
sonable approach for guaranteeing timely access to 
backup replicas during a single disk failure How- 
ever, when compared to the non-redundant case, it 
doubles the bandwidth requirements of cach stream 
and halves the maximum system throughput, as- 
suming disk bandwidth is the bottleneck resource in 
the system. Indeed, we would prefer that the load 
normally handled by a failed disk is equally divided 
among the D — 1 surviving disks. Thus, tolerat- 
ing one disk failure should require that the extra 
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Table 1: The replica placement policies can be orthogonally 
combined with the disk bandwidth reservation schemes. 


bandwidth reserved on each disk be cqual to 5+; 
its bandwidth capacity. Instead, mirroring reserva 
tion reserves extra bandwidth on each disk equal to 
half its bandwidth capacity. 


Essentially, it is wasteful to reserve on a disk extra 
bandwidth for accessing backup replicas of primary 
data stored on more than one other disk. When 
a disk fails, we only need an estimate of the addi- 
tional access load incurred on every surviving disk. 
In order to know that, the access time of the backup 
replicas stored on one disk can be accumulated sep- 
arately for every disk that stores the corresponding 
primary data. Then, the additional access time that 
has to be reserved on a disk in each round is equal 
to the maximum time required for retrieving backup 
replicas for another disk that has failed. The max- 
imum across every other disk is reserved, since we 
don’t know in advance which other disk is going to 
fail. 


For each disk, our implementation maintains D vec- 
tors, indexed by round number of system operation. 
Onc of the vectors keeps track of the total access 
time required for retrieving primary data. The re- 
maining D—1 vectors keep track of access delays due 
to backup data corresponding to primary data of the 
remaining disks. For every disk, we reserve the sum 
of the primary data access time and the maximum of 
the backup data access times required in each round. 
We refer to this more efficient scheme as Minimum 
Reservation. An example with four disks is illus- 
trated in Figure 5. In a later section, we discuss 
ways for limiting the additional computational and 
memory requirements of this approach. 


The two disk bandwidth reservation schemes that we 
just described can be orthogonally combined with 
the two replica placement policies that we intio- 
duced previously, as it is shown in Table 1. 


5 Load Balancing Enhancements 


The load of a failed disk could possibly be shared 
more fairly among the surviving disks if each backup 


replica was declustered across multiple devices. There- 


fore, we break each backup replica into blocks of 


fixed size Bg, and we call this load balanang tech- 
nique Backup Replica Declustering. We choose By 
to be an integer multiple of the logical block size B), 
introduced previously. We allow the last fragment 
of the replica to have a size that is smaller than By 
but integer multiple of By. 


The backup replica blocks corresponding to the pri- 
mary data of each disk are distributed either round- 
robin or pseudo-randomly across the rest of the disks, 
depending on whether deterministic or random replica 
placement is used. In the case of random replica 
placement, we improve block distribution by avoid- 
ing reusing the same disk for storing multiple replica 
blocks of the same file in one round unless we are 
running out of disks. When multiple blocks of size 
Bua are retrieved from a disk during one round of a 
file playback, the minimum required number of read 
requests is submitted to the disk, instead of one per 
block. 


Alternatively, during normal operation we could take 

advantage of multiple available data replicas by dy- 

namically deciding to retrieve the replica stored on 

the disk expected to be the least loaded. The disk 

choice could be based on access time estimations 

available through resource reservations that are made 
during admission control. We use the term Dynamic 

Balancing for this technique. It can be fully applied 

when all the disks are functional and is expected to — 
reduce the load of the most heavily utilized disks in 

each round. 


Both these two techniques have previously been found 
to improve performance when applied to traditional 
transaction processing workloads [23]. Replica declus- 
tering has also becn tried with constant bit-rate stream 
playback [9,15]. Due to the potential for load im- 
balance and reduced device utilization introduced by 
variable bit-rate streams, we investigate the benefit 
from load-balandng techniques in that context. 


6 Experimentation Environment 


In order to keep our presentation complete, we briefly 
describe here important aspects of our prototype im- 
plementation, the characteristics of our benchmarks, 
and the performance evaluation method that we use 
for our experiments [2]. 


6.1 Prototype Overview 


We have designed and built a media server expert 
mentation platform, in order to evaluate the resource 
requirements of alternative disk replication policies 
(2). The different modules are implemented in about 
17,000 lines of C++/ Pthreads codeon AIX4.1. The 
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Seagate Cheetah ST-34501N 


[Data Surfaces SSC SCS 


Buffer Size 512KB 


|BufferSize | IZKBO 
‘Track to ‘Track Seek(read/write 0.98/1.24 msec 
Maximum Seek(read/write) 18.2/19.2 msec 
Average Rotational Latency 


Internal Transfer Rate 
Inner Zone to Outer Zone Burst 
[nner Zone to Outer Zone Sustained 


122 to 177 Mbit/s 
11.3 to 16.8 MB/s 


Table 2: Features of the Seagate SCSI disk assumed in our 
experiments. 


code is linked either to the University of Michigan 
DiskSim disk simulation package [16], which incor- 
porates advanced features of modern disks such as 
on-disk cache and zones for simulated disk access 
time measurements, or to hardware disks through 
their raw device interfaces. The indexing metadata 
are stored as regular Unix files, and during operation 
are kept in main memory. 


The basic responsibilities of the media server include 
file naming, resource reservation, admission control, 
logical to physical metadata mapping, buffer man- 
agement, and disk and network transfer scheduling. 


With appropriate configuration parameters, the sys- 
tem can operate at different leveds of detail. In Ad- 
mission C'ontrol mode, the system receives playback 
requests, does admission control and resource reser- 
vation, but no actual data transfers take place. In 
Simulated Disk mode, most modules become func- 
tional and disk request processing is simulated using 
the specified DiskSim disk array. Techniques for file 
system simulation similar to those previously pro- 
posed are used for integrating the simulated disks 
with our media server prototype [31]. There is also 
the Full Operation mode, where the system accesses 
hardware disks and transfers data to fixed client net- 
work addresses. For the experiments in the current 
study, we used both the Admission Control and the 
Simulated Disk Mode. 


6.2 Performance Evaluation Method 


We assume that playback initiation requests arrive 
independently of one another, according to a Poisson 
process. The system load can be controlled through 
the arrival rate A of playback initiation requests. As- 
suming that the disk transfers are the bottleneck, we 
consider a “perfectly efficient system” that incurs no 
disk overhead when accessing data. Then, we choose 
the maximum arrival rate A = Ai;ngz Of playback re- 





Content Avg Bytes | Max Bytes CoV 
1201221 
0.366 


0.245 





Table 3: We used six MPEG-2 video streams of 30 minutes 
duration each. The coefficient of variation shown in the last 
column changes according to the content type. 


quests equal to the mean stream complction rate in 
that perfectly efficient system. This creates enough 
system load to show the performance benefit of ar- 
bitrarily efficient data striping policies. The mean 
stream) completion rate ji, expressed in streams per 
round, for streams of average data size Si: bytes 
becomes: 


Lie D- Retisk , T cand streams 
Sas reund — 


(1) 


The corresponding system load becomes: p = = < 1, 


where A < Amax = }- 


For each playback request that arrives, the admis- 


EI> 


sion control modulechecks whether available resources 


exist for every round during playback. The test 
considers the data transfer requirements of the re- 
quested playback for every round and also the corre- 
sponding available disk transfer time, network trans- 
fer time and buffer space in the system. If the re- 
quest cannot be initiated in the next round, the test 
is repeated for each round up to [+] rounds into 
the future, until the first round is found where the 
requested playback can be started with guaranteed 
sufficiency of resources. Checking [+] rounds into 
the future achieves most of the potential system ca- 
pacity as was shown previously [2]. If not accepted, 
the request is rejected rather than being kept in a 
queue. 


6.3 Experimentation Setup 


We used six different VBR MPEG-2 streams of 30 
minutes duration each. Every stream has 54,000 
frames with a resolution of 720x480 and 24 bit color 
depth, 30 frames per second frequency, and a 1B? PB? 
PB?PB? pp? 15 frame Group of Pictures structure. 
The encoding hardware that we use generates bit 
rates between 1 Mbit/s and 9.6 Mbit/s. The statis- 
tical characteristics of the clips are given in Table 
3. The coefficients of variation of bytes per round 
lie between 0.028 and 0.383, depending on the con- 
tent type. In the mized benchmark, the six different 
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Figure 6: The mirroring reservation scheme reduces the 
number of streams supported by a factor of two compared 
to the no-replication case. Deterministic replica placement 
sustains an advantage of 25% or more relative to random 
replica placement under the mixed stream workload. 


streams are submitted round-robin. Where appro- 
priate, experimental results from individual stream 
types are also shown. 


The disks assumed in our experiments are Seagate 
Cheetah with ultra- wide SCSI interface and the fea- 
tures shown in Table 2. Such disks were state of 
the art about three years ago, and have all the ba- 
sic architectural characteristics of today’s high-end 
drives. The logical block size B; was set to 164 B 
bytes, while the physical sector size By was equal 
to 512 bytes. The stride size B, in the disk space 
allocation was set to 2 MB. The server memory is 
organized in buffers of fixed size B; = 16A'B bytes 
each, with space of 64 MB for every extra disk. The 
available network bandwidth was assumed to be in- 
finite, leaving contention for the network outside the 
scope of the current work. 


In our experiments, the round time was set equal to 
one second. We found this round length to achieve 
most of the system capacity with tolerable initiation 
latency. This choice also facilitates comparison with 
previous work in which one second rounds were used. 
We used a warmup period of 3,000 rounds and cal- 
culated the average number of active streams from 
round 3,000 to round 9,000. The measurements were 
repeated until the halflength of the 95% confidence 
interval was within 5% of the estimated mean value 
of the number of active streams. The system load 
was fixed at p = 80%, which allows the system to 
reach its capacity while keeping the playback startup 
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Figure 7: With minimum reservation and number of disks 
increasing from 4 to 32, the throughput advantage of deter- 
ministic over random replica placement drops from 15% to 
3%. The corresponding throughput disadvantage of deter- 
ministic placement with respect. te no replication drops from 
28% to 17%. 


latency limited [2]. 


7 Experimental Evaluation 


We compare the data replication and bandwidth reser- 
vation techniques that we introduced with respect to 
the average number of active playback sessions that 
can be supported by the server. The objective is to 
make this number as high as possible. We provide 
supplementary performance intuition with statistics 
on reserved and utilized disk access time across dif: 
ferent stream types and numbers of disks. 


We start with a performance comparison between 
the deterministic and random replica placement poli- 
cies under the mirroring reservation scheme. Sub- 
sequently, we examine the improvement to the two 
placement polides when minimum reservation is ap- 
plied. We also investigate the benefit of dynamic 
balancing assuming that disk bandwidthin each round 
is reserved for only one data replica out of the two 
available. Finally, we consider declustering the backup 
replica of cach stream across multiple disks and allo- 
cating bandwidth according to the minimum reser- 
vation scheme. 


7.1 Replica Placement Comparison 


We use the mixed stream workload to compare the 
performance of alternative replica placement polices 
under the mirroring reservation scheme (Figure 6). 
With the number of disks varying between 4 and 32, 
the measured throughput of replicated disk striping 





General Track: 2002 USENIX Annual Technical Conference USENIX Association 


Deterministic 
No Replication 


CJ Backup Minimum Reservation 


@ Primary 


100 = 
Random 






100 = 
[_] Reserved/Backup (@ Measured/Normat 






USENIX Association 


Reserved Disk Time (Round %) 


Number of Disks 


Figure 8: The disk time reserved in cach round when individual stream 
types are used with minimum reservation. With deterministic replica place- 
ment, the disk time reserved for backup accesses drops from about 24% to 
14% of the round length as the number of disks increases from 4 to 16. With 
random replica placement, the respective percentage drops from about 22% 


to 14%. 


is less than half of what is achieved with no repli- 
cation. In addition, deterministic replica placement 
achieves a throughput advantage of 25% or more rel- 
ative to random replica placement. 


From measurements that we did (not shown here), 
we found that about half of the average disk time 
reserved in the replicated case is wasted for the pos- 
sibility that the backup data will be retrieved. Fir- 
thermore, the access time reserved on each disk by 
random replica placement is about 15%-25% less 
than that of deterninistic placement. Pseudo ran- 
dom choice of the disk that stores a backup replica 
does not completely eliminate the possibility of one 
disk storing more replicas than another, especially 
with small disk arrays. The probability of that oc- 
curring drops as the size of the disk array increases, 
though. However, deterministic placement is more 
consistent in fairly distributing the access load across 
the disk array devices. 


When a disk fails, about 25-30% of the reserved disk 
bandwidth remains unused under both placement 
policies. This is not surprising, since the mirror- 
ing reservation scheme allocates disk bandwidth for 
both the primary and backup replicas of each ac- 
cepted stream. We sec how this inefficiency is al- 
leviated by the minimum reservation scheme in the 
following subsection. 


7.2 Minimizing Reserved Bandwidth 


The minimum reservation scheme improves disk uti- 
lization by allocating on each disk the extra time re- 
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Figure 9: During normal operation, the 
average disk access time that. is measured 
in each round remains within 6-8% below 
the time reserved fer primary data accesses. 
When a disk fails, the measured access time 
is 13-14% lower than the total reserved. 


quired for accessing backup replicas of only one other 
disk. In order to ensure that any single disk failure 
can be handled properly, each disk keeps track and 
reserves the mmaximuin additional time required for 
handling potential failure of any other disk. This 
maximum requirement is calculated separately and 
is generally expected to be different for each disk in 
each round. 


Figure 7 compares the throughput of the different, 
replica placement policies under minimum reserva- 
tion. At a@ght disks, the number of streains sup- 
ported by deterministic placement is only 21% lower 
than that with no replication. This difference be- 
comes 18% and 17%, respectively, with axteen and 
thirty two disks. From the way that the disk band- 


width is allocated in the minimum reservation scheme, 


we would expect the total bandwidth that remains 
unutilized duzing normal operation to be equal to 
the bandwidth capacdty of one disk. Therefore, the 
percentage of unused throughput with respect to the 
non-replicated case should be decreasing proportion- 
ally with the number of disks in the system. For 
example, ideally with 16 disks only the 6 = 6.26% 
of the total disk bandwidth should remain unused 


during normal operation. 


However, in practice, the percentage of the total un- 
used bandwidth of each disk does not change pro- 
portionally with the number of disks (Figures 8). 
This effect can be explained by the MW AX() operator 
that is applied over the estimated time for accessing 
the backup replicas of different disks, in combina- 
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tion with the relativdy large size (more than half 
megabyte on average) of the data retrieved for a 
stream in each round. We explore later the poten 
tial improvement from declustering backup replicas 
across multiple disks. 


Additionally, the difference between deterministic 
and random replica placement becomes less signif- 
icant than the statistical uncertainty at thirty two 


disks. Not surprisingly, deterministic placement main- 


tains a clear advantage (of about 15%) for smaller 
disk array sizes duc to the more regular way of dis- 
tributing the backup replica access load across the 
different devices, These observations are consistent 
with the average access time reserved on each disk 
across different stream types and disk array sizes 
shown in Figure 8. 


In Figure 9, we show the measured disk busy time. 
Under normal disk operation, we observe that deter- 
ministic placement keeps the disks busy an amount 
of time that is 6% lower than what is reserved for 
primary data.” When one disk fails, the remaining 
disks are busy 14% time less than the total reserved. 
With random replica placement, the corresponding 
difference becomes 13% of the round length. This 
is a significant improvement in comparison to the 
25-30% difference between reserved and measured 
time that we reported for mirroring reservation. We 
should keep in mind that, with disk array size equal 
to four, about one third of each disk’s bandwidth has 
to be reserved for the case that one disk fails, and 
this fraction drops as the disk array size increases to 
sixteen (Figure 8). 


It is interesting that, when the reserved backup ac- 
cess time is put into use due to a disk failure, the 
difference between reserved and utilized access time 
increases from 6-8% to 13-14%. At first glance this 
discrepancy appears as reduced accuracy in access 
time estimation. In fact it is due to the MAX() 
operator that we apply to the backup access times 
corresponding to different disks in cach round. This 
reserves enough access time to ensure uninterrupted 
system operation for any particular failed disk. How- 
ever, the reported measured time is taken when the 
disk 0 is assumed inaccessible (Figure 2). Overall, 
we believe that some limited discrepancy between 
predicted and measured access time leaves a reason- 
able cushion space for stable operation. This makes 
the system operation more robust, and guards it 
against nondeterministic factors, such as the sys- 

4In previous work [3], we reported similar differences be- 
tween the average reserved time and the access time measured 


when using the hardware disks of Table 2, instead of their 
simulated models. 
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Figure 10: During normal operation, accessing the replica 
of the least loaded disk improves the throughput by about. 5- 
10% with respect to the non-replicated case. The gain tends 
to increase as the disk array size increases. 


tem bus contention due to network transfers, not 
included in the previous measurements. 


7.3 Improving Load Balancing 


With multiple data replicas available, better load 
balancing can be achieved by choosing the replica 
stored on the least loaded disk during admission con- 
trol. In this case, we leverage data replication for 
improving the system throughput, rather than tol- 
erating disk failures. We use the accumulated disk 
access time estimations in order to choose the least 
loaded disk. Making this choice based on actual 
measurement of the disk access load is not a feasible 
alternative, due to the round-based operation that 
prevents access load propagation from one round to 
the next. 


lyom Figure 10, we see that, when this load bal- 
ancing scheme is used, both replica placement poli- 
cies can support 5-10% more streams than the non- 
replicated case. The difference between the two place- 
ment policies is statistically insignificant, however, 
since the gain from the dynamic replica access ex- 
ceeds the improved load balancing of deterministic 
placement. Determining during aclmission control 
which disk will be used for each data access is a 
reasonable policy for removing hot spots in the disk 
array. However, under sequential workloads the dif- 
ferent disks are equally utilized already, and only 
a limited additional performance benefit can be ac- 
crued with the above policy. 


In Figure 11, we consider the case of declustering 
backup replicas across multiple disks using a fixed 
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Figure 11: Each backup replica is divided into blocks of the 
specified size and distributed across multiple disks. There is 
a minor gain from applying backup replica declustering to 
deterministic placement, while the best. throughput achieved 
with random placement approaches that of deterministic. The 
two horizontal lines correspond to the throughput achieved 
when no cdeclustering is applied to the two replica placement 
policies, respectively. 


block size By. This approach is expected to let the 
failed-disk load be more fairly shared among the sur- 
viving disks. With small block sizes. better load bal- 
ancing leads into some limited throughput. improve- 
ment. As the block size becomes larger, load bal 
andng gets successively worse and throughput de 
creases, because declustering creates fragments with 
sizes iucreasingly different. 

We also antidpate that, with larger block sizes, the 
number of disks accessed for each stream drops and 
the total head movement overhead becomes lower. 


On the other hand, our stride-based allocation scheme 


ensures that at most two head movements are re 
quired per stream on each disk regardless of how 
small the block size is. This keeps limited the neg- 
ative effect of access overhead to throughput. Fi- 
nally, we observe a threshold behavior around By = 
1.2-10° bytes. This is the maximum amount of data 
retrieved in one round for each stream and originates 
from the bit-rate parameters used during encoding 
(Table 3). Effectively, beyond this point there is no 
declustcring. 


These observations are also verified by Figure 12, 
that shows the average difference between the ac- 
cess times of the most and least loaded disk in each 
round. Since the reserved access time of the most, 
heavily loaded disk iss typically 99% in each round, 
the plots in Figure 12 essentially indicate the ac- 
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Figure 12: The average difference (expressed as round 
length percentage) in the reserved access times between the 
most and least loaded disk in each round varies according to 
the declustering block size. This difference can be interpreted 
as one measure of load imbalance within each round. From 
the shape, we see that it has a significant effect to the achieved 
throughpnt. shown in Figure 11. 


cess time requirements of the least loaded one. It 
is remarkable how the shape of the plots is reflected 
to those tracking the system throughput in Figure 
11. We conclude that even the least loaded disk is 
expected to remain more than 80% utilized under 
deterministic replica placement with no dccluster- 
ing (equivalently, with declustering block size larger 
fhari 1/2 + 10"). 


In summary, declustering is only worthwhile with 
small declustering block sizes, and its overall ben- 
efit is found to be limited in the media streaming 
case (less than 3% with cight disks). Moreover, the 
throughput of random replica placement. never ex- 
ceeds that of deterministic. 


8 Discussion 


We considered data replication and bandwidth allo- 
cation schemes that allow tolcrating single disk fail 
ures in disk arrays storing variable bit-rate streams. 
When using simple schemes for reserving disk band- 
width, more than half of the maximum achievable 
throughput is wasted during normal (i.e. no failure) 
operation. Instead, using the minimum reservation 
scheme for accommodating a single disk failure re- 
sults only in throughput reduction of less than 20% 
at disk array sizes sixteen or larger. 


The minimum reservation scheme requires mai ntai n- 
ing number of vectors equal to the square of the 
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number of disks. Each vector is accessed in a circu- 
lar fashion and has minimum Iength cqual to that of 
the longest stream expressed in numbers of rounds. 
When using large disk arrays, this might raise con- 
cerns regarding the computational and memory re- 
quirements involved. In practice, the reduction in 
unused bandwidth is diminishing as the number of 
disks increases beyond sixteen. Therefore, it makes 
sense to apply the data replication within disk groups 
of limited size, when the disk array size becomes 
larger. This kecps the bookkeeping overhead limited 
and preserves the scalability of our method when 
stream data are striped across large disk arrays. 


In previous work, we found that striping data us- 
ing fixed-size blocks achieves lower throughput than 
when using variable-grain striping [2]. The backup 
replica ded ustering should not be confused with fixed- 
size block striping, since the primary data still use 
variable-grain striping. This maintains some benefit 
from multiplexing requests of different transfer sizes 
in each round, and absorbs correlations that oth- 
crwise would create maximum requirements much 
higher than the average. 


Provisioning for VCR functionality is an important 
issue that we don’t consider extensively in the present 
paper. In general, such ficxibility would require deal- 
location of previously reserved resources, when a 
stream playback is suspended or stopped earlicr than 
its normal termination. This can done in a straight- 
forward way, when accumulating disk access delays 
separately for primary and backup data replicas, as 
was already described above. 


The techniques we presented here could be extended 
in straightforward ways for handling multiple disk 
failures. That would require storing multiple backup 
replicas, and making bandwidth reservations for more 
than one failed disk. In servers consisting of multiple 
nodes, failure of an entire node can also be handled 
gracefully, by keeping each disk of a node in a sep- 
arate disk group and limiting the replication within 
cach group. When a node fails, inaccessible data 
for cach of its disks can be retrieved using replicas 
available on other disks of the corresponding groups 
[S_15]. 


9 Related Work 


Most of the previous work on disk array fault-tolerance 


has been done in the context of traditional file server 
and transaction processing workloads. Bitton and 
Gray show that mirrored disks can improve I/O per- 
formance in addition to providing enhanced reliabil 
ity [8]. Hsiao and DeWitt describe chained declus- 
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tering that replicates cach databasc relation on two 
consecutive disks, while the workload is balanced 
across the system using a static load balancing al go- 
rithm [21]. Merchant and Yu propose using different 
stripe sizes for different data replicas [23]. Thus, sys- 
tem opcration can be efficient with both small trans- 
action requests and ad hoc qucries on large parts of 
a relation. 


In our previous work, we found the throughput mcea- 
sured with disk striping of variable bit-rate streams 
to increase linearly as a function of the number of 
disks [1,2]. We also described several design de- 
cisions of our server prototype implementation [3]. 
The system throughput is further improved when 
the disk bandwidth requirements of individual streams 
are smoothed across diffcrent playback rounds [4], 
and high disk bandwidth utilization is achieved across 
both homogeneous and hcterogencous disks. System 
reliability is a crucial issue when building infrastruc 
ture for commercial services. Addressing this issue 
creates a strong case for storage of variable bit-rate 
strcams, and makcs the results of the present paper 
indispensable part of our previous published work. 


The related work from media server research is mostl y 
focused on fault-tolerance techniques when striping 
constant bit-rate streams [5, 6, 32]. Disks are grouped 
into clusters, and data blocks from separate disks 
in cach cluster are combined with a parity block to 
form parity groups. The blocks of a parity group are 
considered to be retricved and transmitted in one or 
multiple rounds, and the parity blocks are stored on 
data disks or dedicated parity disks. For improving 
ovcrall efficiency, certain data blocks are not trans- 
mitted in a transition period following a disk failure. 


Ozden et al. propose reading ahead the data blocks 
of an entire parity group prior to their transmission 
to the client [25]. When a data block cannot be ac- 
cessed, it can be reconstructed using a parity block 
that is read instead. Alternatively, an entire parity 
group is retricved cach time a block cannot be ac- 
cessed. Balanced incomplete block designs are used 
for constructing parity groups that keep the load of 
the disk array balanced [20, 25]. The dynamic reser- 
vation scheme that they introduce minimizes the ex- 
tra bandwidth that has to be reserved on a disk for 
reconstructing failed-disk data blocks. 


Gafsi and Biersack compare several performance mea- 
surcs of alternative data-mirroring and parity-based 
techniques for tolerating disk and node failures in 
distributed video servers [15]. When entire data 
blocks of one disk are replicated on different disks, 
half of the total bandwidth of cach disk is reserved 
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for handling the disk failure case. The wasted through- 


put is critically reduced with the minimum reserva- 
tion scheme that we propose here. 


Tewari et al. study parity-based redundancy tech- 
niques for tolerating disk and node failures in clus- 
tered servers [30]. By distributing the parity blocks 
of an object. on a random permutation of certain 
disks they can keep balanced the system load when a 
disk fails. Alternatively, Flynn and Tetzlaff replicate 
data blocks across non-intersecting permutations of 
disk groups [14]. Multiple available data blocks can 
be used far dynamic balancing of disk bandwidth 
utilization across different devices. Instead, Birk ex- 
amines selectively accessing parity blocks of video 
streams for better balancing the system load across 
multiple disks [7]. 


For failures in video servers supporting variable bit- 
rate streams, Shenoy and Vin apply lossy data re- 
covery techniques that rely on the inherent redun- 
dancy in video streams rather than error-correcting 
codes. Alternatively, they propose taking advantage 
of the sequential block accesses during playback and 
reconstructing missing data from surrounding avail- 
able blocks, at, the cost of an initial playback latency, 
or temporary disruption when a failure occurs [29]. 


Bolosky et al. decluster the block replicas of one 
disk across d other disks. In case of disk failure, 
the extra bandwidth required for retrieving the data 
of the failed disk is shared among the d other disks 
[9]. Inlater work, they also consider providing fault- 
tolerant support for multiple streams with different 
bit rates [10]. In our experience, declustering does 
not add significant improvement with respect to the 
case of replicating the data blocks of one disk in their 
entirety on different disks. 


Mourad describes the doubly-striped disk mirroring 
technique that distributes replica blocks of one disk 
round-robin across the rest of the disks [24]. The 
system load is equally distributed across the sur- 
viving disks in case of a disk failure. The deter- 
ministic replica placement that we describe extends 
doubly-striped mirroring for handling variable bit- 
rate streams and the reduced device utilization that 
they potentially introduce. 


Santos et al. compare disk striping against data 
replication on randomly chosen disks [27]. Using 
constant bit-rate streams, they conclude that ran- 
dom replication can outperform disk striping with 
no replication. In our comparison using variable bit- 
rate streams instead, we found an advantage of de- 
terministic replication over random replication that 


diminishes as the number of disks increases. 


10 Conclusions 


We studied issues related to data replication of vari 

able bit-rate streams striped across multiple disks for 

improving system reliability and performance. We 

introduced the minimum reservation scheme that 

minimized the wasted throughput required for keep- 

ing accepted playbacks uninterrupted during a disk 

failure. At moderate disk array sizes, the through- 

put is less than 20% lower than what is achieved with 

no replication. Deterministic placement of backup 

data is found to achieve better performance than 

random placement across the different disks, although 
the advantage becomes insignificant as the number 

of disks increases. Retrieving the data replica of 
each stream stored on the least loaded disk adds 

an improvement of no more than 10% with respect 

to the non-replicated case. Finally, declustering the 

backup replicas across multiple disks does not seem 

to considerably improve the performance achieved 

with deterministic replica placement. 
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Abstract 


The Profile Collection Toolkit (PCT) provides a 
novel generalized CPU profiling facility. PCT en- 
ables arbitrarily late profiling activation and ar- 
bitrarily early report generation. PCT usually re- 
quires no re-compilation, re-linking, or even re- 
startin g of programs, Profiling reports gracefully de- 
grade with available debugging data. 


PCT uses its debugger controller, dbctl, to drive 
a debugger’s control over a process, dbctl has a 
configuration language that allows users to specify 
context-specific debugger commands, These com- 
mands can sample general program state, such as 
call stacks and function parameters. 


For systems or situations with poor debugger sup- 
port, PCT provides several other portable and flex- 
ible collection methods. PCT can track most pro- 
gram code, including code in shared libraries and 
late-loaded shared objects. On Linux, PCT can 
seamlessly merge kernel CPU time profiles with 
user-level CPU profiles to create whole system re- 
ports. 


1 Introduction 


Profiling is the art and science of understanding 
program performance. There are two main families 
of profiling techniques, automatic code instrumen- 
tation and statistical sampling. Code instrumenta- 
tion approaches use a high-level language compiler 
or linker to incorporate new instructions into ob- 
ject file outputs. These instructions co unt how many 
times various parts of a program get executed. Some 
instrumentation systems [12] count function activa- 
tions while others [1, 21] count more fine-grained 
control fiow transitions. Sampling approaches mo- 
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mentarily suspend programs to sample execution 
state, suchas the valueof the programco unter. How 
frequently certain locations occur during an execu- 
tion estimates the relative fraction of time incurred 
by those parts of the program. 


PCT is a sampling-based profiling system that 
shows a new way to construct effective perfor- 
mance investigation tools. PCT demonstrates that 
the same tools programmers are familiar with for 
answering questions about correctness can be used 
for effective performance analysis. The philosophy 
of PCT is that profiling is a particular type of de- 
bugging and that the same preparations should be 
adequate. The focus of PCT is CPU-time profiling 
rather than real-time profiling, t hough, in principle, 
sampling may be applied to either. 


PCT is also flexible and easy to use. Enabling profile 
collection rare y requires re-compiling, re-linking, or 
even re-starting a program. In its simplest usage, 
adding a one word prefix to the command-line can 
activate collection over entire process subtrees and 
emit a basic analysis report at the end. PCT can 
track CPU time spent in the main program text, 
shared libraries, late-loaded dynamic objects and in 
kernel code on Linux. PCT works with a variety of 
programming languages. 


A novel aspect of PCT is that it allows sam- 
pling semantically rich data such as function call 
stacks, function parameters, local or global vari- 
ables, CPU registers, or other execution context. 
This rich data collection capability is achieved via a 
debugger-controller program, dbct1l. Using debug- 
gers to probe program state allows PCT to sam- 
ple a wide variety of values. Statistical patterns in 
these values may explain program performance. For 
example, statistically typical values of a function’s 
parameters may explain why a program spends a lot 
of time in that function. 


Additionally, dbctl can drive parallel non- 
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Causally Informative 
Extensible 

Late- binding 
Early-reporting 
Non-invasive 
Low-overhead 
Portable 

Robust 

Tolerant 


Exhaustive 
Multilingual 


Maximize ability to explain performance charactenstics 
Sample many kinds of user-defined program state 

Defer as long as possible the decision of whether to profile 
Make profile reports available as soon as possible 

Require no extra program build steps or copies of objects 


Minimize extra program run time 

Support informative profiles on any OS and CPU 

Do not rely on program correctness, in particular clean exits 

Report quality should gracefully degrade with worse system support, 
poorer profile data, and less rich debugging data in object files. 

‘Track as much relevant CPU activity as possible 

Support different programming languages, multi-language environ ments 





Table 1: Desirable profiling system features. 


interactive debugging sessions. As the original 
process creates children, dbctl can spawn off 
new debugger instances, reliably attaching them 
to those children. Using a debugger-controller 
allows a portable implementation of process subtree 
execution tracing tools, such as function call 
tracers. 


The functionality of PCT gracefully degrades with 
the available support in the system and the executa- 
bles of interest. Debugging data or symbol] tables are 
needed for highly meaningful reports. Nevertheless 
even stripped binaries allow some analysis. For ex- 
ample, one can track the usage of dynamic library 
functions or emit annotated disassembly. In concert 
wi th instrumentation-based basic block-level profil- 
ing such as gcov, PCT can even estimate CPU cy- 
cles per instruction. Sampled data can be windowed 
in time to isolate different CPU intensive periods of 
a program’s execution. The various report formats 
are available through a set of composable primitive 
programs and shell pipelines. 


Profile reports may be generated at any time, even 
prior to program termination and several times over 
the life of one process. Several granularities are 
available for data aggregation and report formats. 
Depending on the debugging data available in ex- 
ecutables, users can select how to display program 
locations. This may be at the level of individual] in- 
structions, line numbers, functions, source files, or 
even whole object files or libraries. 


The organization of this paper is as follows. Sec- 
tion 2 discusses our design objectives. To make 
PCT’s capabilities more concrete, Section 3 shows a 
few examples. Section 4 then elaborates upon PCT’s 
implementation of data collection. Section 5 details 
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report generation strategies. Section 6 evaluates the 
overhead and accuracy of the toolkit. Section 7 dis- 
cusses some other approaches to profiling. Section 8 
describes how to obtain the software. Finally, Sec- 
tion 9 concludes. 


2 Design Objectives 


The design goals of PCT were driven by user needs 
and the inadequacies or inaccessibility of pnor sys- 
tems. Table 1 highlights these objectives. The fol- 
lowing section argues for the importance of each in 
turn, and the approach of PCT in general. 


Programmers use profiling systems to understand 
what causes performance characteristics. E.g., if cer- 
tain functions dominate an execution, then a profile 
should tell us why those calls are made, and why 
they might be slow. If functions are called with ar- 
guments implying quite different “job sizes”, then 
a profile should be able to capture this for analy- 
sis. Exactly how causally informative profiling can or 
should be is an open issue. More information is bet- 
ter up until some point where overhead and analysis 
tractability concerns become a problem. Program- 
mers currently have far more a prom knowledge 
about what to look for than any automatic system 
can hope to have. A practical answer is an exten- 
sible collection system that lets users decide what 
program variables are most relevant to subsequent 
performance analysis. 


Performance problems often arise only on inputs by 
end users unanticipated by the programmer or in 
very late stages of testing. These issues are thus 
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discovered at the worst possible time for rebuild- 
ing a program and all its dependencies. Long run- 
ning programs such as system services often have 
phased behavior. That is, sections of the program 
with quite distinct performance characteristics ex- 
ecute over various windows in time. Profiles over 
entire program executions can introduce unwanted 
averaging over this phased behavior, making results 
more difficult to interpret. A direct and flexible way 
to address this problem is to allow late-binding. Ide- 
ally, activating and deactivating profile collection 
should be possible at any stage in the life cycle of 
a program. As an immediate correspondent, early- 
reporting is also desirable so that long running pro- 
grams with highly active phases do not need to ter- 
minate before a profile can be examined. Together, 
these let programmers apply whatever knowledge 
they have about phased behavior. 


Classic instrumentation techniques raise a num- 
ber of administrative, theoretical, and practical is- 
sues. Instrumentation usually requires extra steps to 
build two versions of executables and libraries, or- 
dinary and instrumented. It is often problematic to 
require recompilation of all objects in all libraries 
or to require commercial vendors to provide mul- 
tiple versions of their libraries. Providing multiple 
library versions can be a burden even on the var- 
ious contemporary open source platforms, For in- 
stance, profiling instrumented libraries in /usr/lib 
On open source distributions are scarce or entirely 
absent. Also, it is possible to instrument code long 
after linking it. For example, binary rewriting tech- 
niques along the lines of Pixie [23] or Quantify [14] 
allow this. Completely dynamic instrumentation is 
also possible. [18] 


Nonetheless, instrumentation, at whatever time, 
raises several issues. Instrumented code really is not 
the same as the original code. Subtle microarchi- 
tectural effects can make it hard to understand the 
overhead of new instructions. Beyond theoretical ac- 
curacy issues, there is also a more practical concern 
in that getting the instrumentation correct is a chal- 
lenging problem in itself. Profiling instrumentation 
can interact badly with “new” compiler features, op- 
timization strategies, or uncommon language usage 
patterns. In the worst case, which is all too frequent, 
the produced executable may not even runcorrectly. 
Finally, with the possible exception of fully dynamic 
instrumentation, this strategy is inherently less ex- 
tensible. Only a priori data types can be extracted, 
and this is usually limited to simple counts of ex- 
ecutions to avoid re-implementing a good deal of 


compiler technology. 


While instrumentation has the virtue of precision, 
the above considerations suggest that we should go 
as far as possible with systems that are non-invasive 
to the stream of instructions the CPU encounters. 
In essence, this implies a sampling- based approach. 
Sampling also has the virtue of incurring tunably 
low overhead. 


Performance problems often anse only when pro- 
grams are used in very different environments from 
where they were developed. Platform-specific profil- 
ing packages can be more efficient and occasionally 
more capable. However, they do not help if perfor- 
mance problems cannot be reproduced on supported 
platforms or environments. Programmers also have 
a rational resistance to leaming and relying upon 
multiple, disparate system-specific tools and inter- 
faces. Therefore, a more portable system is more 
valuable. 


Portability concerns also suggest a sampling ap- 
proach. Any preemptive multi-tasking OS already 
suspends and resumes programs as a matter of 
course. The only missing pieces for profile collec tion 
are a means to suspend frequently and a mecha- 
nism to inspect the state of a program. Reading a 
program’s state is inherently simpler than re-writing 
its code. Thus, sampling is typically no more intru- 
sive than ordinary preemption and requires simpler, 
less specialized system support than automatic in- 
strumentation. 


Some past profiling systems have used in-core 
buffers that are written to disk in atexit() handlers 
at the end of a clean program shutdown. A system 
should not mandate clean termination in order to 
diagnose performance problems. One phase of a pro- 
gram may warrant performance investigation even 
if other phases are buggy. Inputs needed to trigger 
performance pathologies may also instigate incor- 
rect behavior. Conversely, performance pathologies 
can easily trigger failure modes not ordinarily en- 
countered. Thus, profile collection should ideally be 
robust against program failure. Many existing imple- 
mentations could be adapted to be more cautious in 
this regard. 


Bottleneck code can potentially hide anywhere in a 
program. Restricting profiling coverage to only code 
compiled or linked into the address space in certain 
ways leads to many “holes” in the accounting of 
where execution time was spent. The more exhaus- 
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five code coverage is, the more likely a profile will 
unravel performance mysteries. 


Mixed programming language systems have become 
pervasive in modern software development environ- 
ments. While C and C++ are a canonical exam- 
ple, it can be the case that a wider range of lan- 
guages, e.g. FORTRAN, Ada are supported within 
one program. If unified source-level debugging ex- 
ists for these multilingual environments then source- 
level profiling should also be supported. A profiling 
system wedded to a particular programming lan- 
guage or code generation system is too inflexible. 


PCT is the first profiling system known to the au- 
thors to possess all of these properties simultane- 
ously. Extensible and informative data collection is 
achieved through the ability of source level debug- 
gers to compute arbitrary expressions and do de- 
tailed investigation of program state. PCT is non- 
invasive and low-overhead since sampling does lit- 
tle more than what the OS ordinarily does during 
task switching. The sampling rate can be changed 
to trade-off overhead with accuracy. Late-binding 
is achieved by delaying activation of sampling code 
or having it instigated by an entirely different pro- 
cess, namely the debugger. Portability derives from 
relying only on old, well-propagated system facili- 
ties dating back to the mid-1980s. Robustness en- 
sues from the earliest possible commitment of data 
to the OS buffer cache, which is closely related to 
producing reports as soon as any data has been col- 
lected. The system is as multi-lingual as the exe- 
cutable linking environment allows. PCT is as ex- 
haustive as the debugger, OS, and build environ- 
ment allows. The very small report generation mod- 
ules enable tolerating various levels of debugging 
data, customizing reports, and porting PCT toa 
new platform. 


PCT is also a small system. The code for PCT is 
only 3,500 commented lines of C code and 300 lines 
of shell scripts. This compact delivery of function- 
ality is possible only because PCT greatly leverages 
common system facilities. 


3 Examples 


On many systems getting a quick profileis as simple 
as: profile myprogram args... 


More concretely: 
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$ profile ./fingerprint /bin |] head 

13.9% /u/cblake/hashfp/binPoly64.C:101 
.6% /u/cblake/hashfp/binPoly64.C:86 

3% /1ib/libc-2.1.2.so:getc 

.3f,  /u/cblake/hashfp/fingerprint.C:112 
3,  /u/cblake/hashfp/binPoly64.C:80 

2/4 /u/cblake/hashfp/binPoly64.C:70 

0, /u/cblake/hashfp/binPoly64.C:96 
.7%  /u/cblake/hashfp/binPoly64.C:102 
O% #/u/cblake/hashfp/fingerprint .C: 116 
.8%  /u/cblake/hashfp/fingerprint.C:104 


By default line numbers are used for source coordi- 
nates. If only symbols are available they are used. 
Finally, raw objectfile:address pairs are printed when 
there is no debugging data at all. 


Profiling mixed kernel and user code on Linux is 
similar. Below is a quick profile of the disk usage 
utility which recurses down a directory tree sum- 
ming up file allocations:! 


$ profile -k du -s /disk/paO | head 
30.4% /usr/src/linux/vmlinux: iget4 


12.4% /usr/src/linux/vmlinux: ext2_find_entry 
5.4%, /usr/src/linux/vnlinux:try_to_free_inodes 
3.5% /usr/src/linux/vmlinux:ext2_read_inode 
3.3% /usr/src/linux/vmlinux:unplug_ device 
2.4% /usr/src/linux/vmlinux:lookup_dentry 
2.0% /usr/src/linux/vmlinux:system_call 
1.8% /usr/src/linux/vmlinux: getblk 
1.0% /1ib/libc-2.1.2.so:open 
0.9% /lib/libc-2.1.2.so:__lxstat64 


Note the call hierarchy in the following program: 


int worker(unsigned n) { while (n--) /**/ ; } 
int dispatch_i(unsigned a) { worker(a); } 
int dispatch_2(unsigned b) { worker(b); } 


int main(int ac, char **av) { 
dispatch_1(10000000) ; 
dispatch_2(20000000) ; 
return QO; 


Before doing any profiling it is obvious that essen- 
tially all run time is in the function worker (). There 
are two paths to this function, as shown clearly via 
the debugger- based hierarchical profile: 


$ profile ~gdb -13 hier-test 
67.6% worker <- dispatch_2 <- main 
32.4% worker <- dispatch_1 <~- main 


1Directory and i-node data was pre-read to make the re- 
sults reflect CPU time spent in the 2.2 kernel. 
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Finally, consider sampling more semantically rich 
data. In general this requires amending a 10 line 
dbctl script similar to the following: 


1 EXECQ “ox { 
2 #include "gdbprof_prologue.dbctl" 
3  PAT_GROUP(default) { 


4 PAT(1) "signal=\"SIGVTALRM\"" | OUT("pc") { 
5 "backtrace 4" | OUT("stack") ; 

6 # OTHER DEBUGGER EXPRESSIONS 

7 "continue"; 

8 } 

9 #include "gdbprof_epilogue.dbctl" 

i” | 


1i } 


A full description of the pattern-driven state ma- 
chine language is beyond the scope of this paper. 
The main EXEC pattern on line 1 restricts which 
executables the entire rule applies to. Lines 4, 5, 
and 6 simply capture the pc in the output file ”pc”, 
and four levels of stack backtrace in the output file 
“stack”. Adding more debugger commands and data 
files is just a matter of adding a line to the dbctl 
script. Depending on the expressions sampled, var- 
ious post-processing steps may be needed. 


We pro vide a simpler interface for the common case 
of sampling scalar numbers. Below shows how to 
sample values of n, inside the function worker () 
where it is meaningful. 


$ profile -gdb \ 
-expr ‘hier-testQ@worker@n’ ‘int-avg’ \ 
hier-test 

8 .38e+06 


The -expr option takes two arguments — a context- 
specific expression to generate data in the debugger, 
and a program to format the collected data. The 
context specific expression is an ‘@’ separated tuple 
of strings: a program pattern, a function pattern, a 
debugger expression. 


4 Data Collection 


The general implementation philosophy of PCT is to 
support a full set of options for every aspect of pro- 
filing. This minimizes the chance that some system 
limitation will prevent any profiling outright and 
enables “best effort” profiling. Small, composable 


primitives also ease tailoring PCT behavior. Ba- 
sic users generally use several generic driver scripts, 
while more advanced users create their own tailored 
script wrappers. 


4.1 Activating Sampling Code 


PCT has three basic collection strategies: debug- 
gers, timer signal handlers, and profil().[5] The 
first never requires re-starting or re-linking a pro- 
gram, but can have substantial real-time over- 
head. The latter two are fall-back, library-based 
strategies which can be used when low overhead 
is preferrable or when debugger support is inade- 
quate. The library-based samplers are, however, less 
portable. They require linker support for C++-style 
global initializers and also require either a dynamic 
library pre- loading facility or manual re-linking. Dy- 
namic pre-loading is commonly available with mod- 
ern dynamic linkers [3], though not all programs are 
dynamically linked. In the worst case, if C-+-+-style 
linking is unavailable, programmers can manually 
invoke the initializer inside their main() routine. We 
now examine these samplers in more detail. 


The oldest portable profiling primitive is profil(). 
This system call directs kernelresident code to ac- 
cumulate a histogram of program counter locations 
in a user-provided buffer. While the call interface 
does potentially allow multiple executable regions, 
the authors know of no operating systems that can 
activate more than one region at a time. To ensure 
robustness, PCT allocates the userspace buffer as 
an mmap()-ed file. This also allows the profile to 
be accessible at any time to other processes, such 
as report generators. The profil() call to acti- 
vate kernel-driven collection can be done with ei- 
ther a dynamically pre-loaded or statically linked-in 
library. 


A source-level debugger affords a more general 
sampling activation strategy. The debugger uses 
the ptrace() facility and catches all signals deliv- 
ered to the process, including virtual timer alarms. 
ptrace() can be used to attach and detach from 
processes at any time and any number of times over 
the lifetime of a process. PCT uses a debugger- 
controller program to implement this procedure 
portably. 


The PCT debugger controller drives the debugger 
which in turn controls the process via ptrace(). 
The debugger calls the POSIX setitimer() sys- 
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tem callin the context of the target process. This 
installs virtual time interval timers for the target 
process, Once these timers are installed, the kernel 
will deliver VTALRM signals periodically to the tar- 
get process. At each signal delivery, control will be 
transferred to the debugger. At this point the debug- 
ger driver issues whatever debugger commands are 
necessary to collect informative data and then con- 
tinue program execution. For example a backtrace 
or where command typically produces a sample of 
function call stack data. The real time of the sam- 
ple can also be recorded. Section 4.3 discusses the 
details of debugger control. 


When library code can be used, a pre-main() ini- 
tializer sets up interval timers, signal handling, and 
data files. gcc-specific, C++, or system-dependent 
library section techniques can be used to install the 
library initializer. Library code may be statically or 
dynamically linked, or preloaded for dynamically- 
linked executables via the $LD-PRELOAD environ- 
ment variable. 


PCT collection behavior is controlled through the 
$PCT environment variable. It controls options such 
as output directories, histogram granularity, data 
format, and so on. It also provides a convenient 
switch for whether profiling happens at all. $PCT and 
$LD.PRELOAD can both be inherited across fork() 
and exec(). This conveniently enables profile col- 
lection activation on whole process subtrees. 


4.2 Collecting Data 
4.2.1 Types of Code 


The debugger collector supports tracing whatever 
code the debugger can recognize. All debuggers han- 
dle the main program text. Most modern debuggers, 
e.g. gdb, can debug code in shared libraries and late- 
loaded object files on most operating systems. 


PCT library-based collectors have more specific re- 
strictions. They can collect data on several kinds of 
code: 


e One contiguous region — usually the main pro- 
gram text. This is the oldest style of profiling 
and works in almost any OS and scenario. 


e Shared libraries. On Linux the instantaneous 
bindings of virtual memory regions to files are 
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exposed, Reading /proc/PID/maps reveals ex- 
ecutable regions and corresponding object files. 


On most BSD OSes 1dd reports load addresses 
of shared libraries. These addresses may be 
cached in files similar to /proc/PID/maps and 
read in by the global initializer. 


e Late-loaded (e.g. dlopen()ed) code: On Linux 
whenever a PC cannot be mapped to a known 
memory region, the signal handler re-scans 
/proc/PID/maps to attempt to discover new 
regions. If it succeeds, the region table is up- 
dated and the counts processed. When an ob- 
ject file for the PC cannot be found, further 
re-scanning is inhibited to suppress repetitive 
failed searches. 


e Kernel code: Linux provides a /proc/profile 
buffer for the main text of the kernel. Currently, 
Linux does not support profiling loadable ker- 
nel modules. 


As mentioned in Section 4, profil()-based collec- 
tion is generally only available for a single contigu- 
ous region of address space. T'hese other types of 
code are all supported by the more general library- 
based collector. Profiling kernel modules could be 
added to Linux or other OS’s using the same tech- 
niques that PCT uses for managing shared libraries 
and late-loaded code. 


4.2.2 Types of Profiling Data 


Collection methods based on profil() or 
/proc/profile afford little choice as to the 
type or format of data collected. Other sampling 
methods, such as $LD.PRELOAD and debuggers 
allow collecting a variety of data. This flexibility 
creates choices as to what data is collected for later 
analysis, how and where it is stored. 


PCT provides several data storage formats. T he spe- 
cific profiling situation will usually determine w hich 
is best. The choices are: 


e Debugger output files: a different log file is used 
to save the output of each user-specified sam- 
pling expression. 


e Histogram file: stores frequency counts for var- 
ious code regions. This guarantees bounded 
space, but cannot window data. 
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e Sampleordered log file: allows simple time- 
windowing of collection events, post facto his- 
tograms, but can grow in size indefinitely. 


e Circular log file: This is similar to the sample- 
ordering except that the user bounds the size, 
which effectively saves only the last N samples. 


4.3 Controlling Debuggers 


The PCT toolkit includes a controller program 
dbctl for driving debugger tools such as gdb. The 
controller program is a state machine described by 
user specified files. A transition in the state diagram 
occurs when the controller recognizes a regular ex- 
pression in the output of the debugger. For each 
transition, there is a set of debugger commands to 
issue as well as a series of controller actions. The de- 
bugger commands can be any command appropriate 
for the debugger tool being controlled. For example, 
controller actions might include logging debugger 
output, spawning new debuggers to attach to child 
processes, and capturing specified debugger outputs 
in internal controller variables. 


In the case of profiling, the dbctl tool sets up the 
interval timers in the processes to be profiled. When 
the timers expire a signal is raised which transfers 
control to the debugger and generates output indi- 
cating the context of suspension. The controller rec- 
ognizes various process contexts and issues context- 
specific instructions. These can include writing out 
the call stack, local variables and function argu- 
ments, or any arbitrary debugger expression. 


Profiling is not the only application of dbctl. The 
tool can also be used to implement a portable 
strace [7] or ltrace [11] facility. While many de- 
buggers have function call tracing capabilities, trac- 
ing an entire process tree is more challenging. 


The UNIX ptrace() mechanism has traditionally 
had rather weak support for following both a parent 
and child process across a fork (). Typically, there is 
a constraint of one-to-one binding between a traced 
process and a tracing process. After a fork() only 
one of the potentially traced processes can remain 
under external control. The other is released to be 
scheduled by the OS. Breakpoints left in a untraced 
process cause a SIGTRAP that causes the process 
to die since it has no debugger to catch the signal 
on its behalf. Therefore a debugger arranges things 
so that breakpoints are disabled across a fork () for 


one of the two processes. 


For dbctl this means that if we arrange to follow 
the parent, then a fork()ed child could “run away” 
from the controller, possibly fork()-ing grandchil- 
dren before dbctl can attach a new debugger to 
it. Similarly, if we arrange to follow the child, the 
parent could run away forking other children be- 
fore anew debugger can be attached. Ideally, kemels 
would provide a standard interface to “fork and sus- 
pend” ptrace()d processes. Lacking a natural in- 
terface to avoid this race condition, PCT develop ed 
an interesting work around. Our fork()-following 
protocol guarantees that no child process is ever lost 
and that breakpoints can be re enabled im mediately 
after the call to fork(). 


The protocol works as follows. First, we set a break- 
point at all fork() calls to catch the spawning of 
children. When the fork() breakpoint is hit, the 
debugger disables all breakpoints so that the un- 
traced process will not get spurious SIGTRAP sig- 
nals. It then installs pause() as a signal handler 
for SIGTRAP. Finally, it sets a breakpoint for the in- 
struction following the call to fork(). 


The parent remains ptrace()d all along, and imme- 
diately traps to the debugger because of the fork ()- 
return breakpoint. All normal breakpoints are re- 
enabled. The child process id can be found on the 
stack as the return value from the fork(). dbctl 
uses this pid to attach a new instance of the debug- 
ger to the child process. 


Concurrently, the fork()-return breakpoint causes 
a SIGTRAP to be delivered to the child. Since it no 
longer has a tracing process, the process’ own sig- 
nal handler is invoked. In this case that function is 
pause, a system call which simply waits until some 
signal is delivered. When a newly spawned debugger 
successfully attaches to the process it interrupts the 
ongoing pause. The procedure of attaching a new 
debugger is now complete. dbctl then re-establishes 
any necessary breakpoints and so on in both pro- 
cesses and lets them run again. 


4.4 Limitations 


Library-based histogram collectors face a problem 
with fork()d processes/threads which truly run in 
parallel (i.e. on multiple CPU systems). The paral- 
lel processes can potentially overwrite each other’s 
counter updates. The result is a missed counter in- 
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crement with the last writer winning the counter 
bump. This is rare and is probably not an issue in 
practice. In any event, one can assess the number of 
lost counts by the total scheduling time given and 
the total counts collected. If there is a major dis- 
crepancy one can switch to log-file profile collection 
which does not share this problem. 


Hierarchical samples are currently only supported 
with the debugger collector. The code to walk back 
a stack frame is conceivably simple enough for some 
CPUs to embed directly into our signal handler li- 
brary. This could drastically reduce overhead at the 
cost of sacrificing some CPU portability and prob- 
ably some language neutrality. 


Our debugger controller requires that executables 
either be dynamically linked to the C library, or if 
statically linked contain a few critical symbols, such 
as pause, that may not be strictly required to be 
present. Of course the debugger can also do very 
little with executables stripped of all symbol data. 


Sampling rates are limited by maximum VTALRM de- 
livery rates. These typically range from 1 to 10 ms. 
Some systems allow increasing this rate. Depend- 
ing on the richness of collected data, it may not be 
desirable to increase this rate, as that would entail 
more real-time overhead. 


5 Data Analysis 


PCT data collection strategies produce files with 
quite different information. Designing one mono- 
lithic way of reducing this data to programmer in- 
terpretable relationships is hard. Instead PCT pro- 
vides a toolkit of data aggregation and transforma- 
tion programs. These can be easily composed via 
UNIX shell pipelines. Their usage is simple enough 
that users can tailor simple scripts toward individ- 
ual circumstances and preferences. 


5.1 Source Coordinate Resolution 


Debuggers emit high-level source coordinates as a 
matter of course. They are constrained by how much 
debugging data was been compiled into the executa- 
bles and libraries being used. If these object files 
have been compiled with the full complement of 
debugging data then source and line number-level, 
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function-level, and address-level coordinates are all 
available. Usually, all are present in the textual de- 
bugger output PCT records. This mode of data ool- 
lection then yields a lot of choices. PCT lets users 
select the coordinates to be used in reports. 


On the other hand, library-based collectors do not 
resolve program counter addresses to source co- 
ordinates while the program is running. Instead 
they record only program counter addresses and 
defer higher-level coordinate translation to report 
generation-time. These addresses are saved in com- 
pact binary data formats that keep logs small and 
minimize IO overhead. There is usually one binary 
data file for each independently mapped region of 
pro gram address space. Embedded within these files 
are the path names and in-memory offsets of the 
memory-mapped object files. This provides the key 
information for deferred translation to understand 
how addresses in memory correspond to addresses 
in the object files. 


Translation of program addresses to more meaning- 
ful source coordinates can be awkward. Object file 
formats vary substantially. The GNU binary file de- 
scriptor library gives some relief, allowing the writ- 
ing of programs which directly access debugging 
data and symbol tables. This library may be un- 
available, out of date, or not support the necessary 
object file formats. As the examples in Section 3 
show, PCT makes a best-effort attempts to trans- 
late coordinates. 


At the least, if executable files retain their symbol 
table, the system nm program or the debugger can 
interpret it. If there is no debugger, the GNU binu- 
tils package provides a convenient addr2line pro- 
gram which can map PCs to file:line source coordi- 
nates. If there is a debugger installed, then a debug- 
ger script can operate just as addr2line in the re- 
stricted capacity of address translation. If there is no 
debugging data at all, as for stripped binaries, then 
a disassembly procedure is always an option. Indeed, 
for instruction-level optimization, it may even be de- 
sired to produce count-annotated disassembly files 
as reports. Of course, assembly-level expertise is re- 
quired to interpret such reports. 


PCT provides a printing program pct-pr which 
bridges the gap between PCT binary data files and 
programs which affect address translation. pct-pr 
assumes a simple and convenient protocol for shell 
pipeline syntax. Translators are run as co-processes 
to pct-pr. I.e., programs read a series of PCs on 
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their standard input and emit corresponding source 
coordinates to their standard output. 


This co-process setup allows fine-tuning of the pro- 
tocol with pct-pr command-line arguments. For ex- 
ample, printf ()-style format strings allow tailor- 
ing the ob ject file PC stream to input requirements, 
and also allow customizing the output stream. Users 
can stamp PCT binary data files with particular PC 
translation requirements, or simply describe which 
translators to use on the pct-pr command line. 


PCT also provides a program addr2nm to translate 
PCs using only the system nm and symbol tables 
in executables. This program first dumps nm output 
into a cache directory if it is not already there. Once 
this file exists, addr2nm does an approximate binary 
search on each inbound PC, discovering the symbol 
with the greatest lower address. 


5.2 Data Aggregation 


One often wants to aggregate profile data over var- 
ious uses of the same programs or libraries. The 
canonical example is combining many runs of short- 
lived programs. In the context of profiling a pro- 
cess tree, one may want to examine many distinct 
processes aS One aggregate set of counts. For ex- 
ample, a libc developer might be interested in 
all the usages of some particular function, e.g. 
printf (), throughout a process tree. PCT supports 
these various styles of aggregation via simple file- 
name conventions and traditional UNIX filename 
patterns. A user can select collections of data files 
by common filename substrings such as the name 
of the ob ject files of interest. For example, pct-pr 
/tmp/pcet/ct/myprogram.*/libc* would generate 
source coordinates for all samples of the libc code 
used by nyprogram. 


The output of source coordinate translation for 
each file in a collection is a simple pair of sam- 
ple counts and labels for that location in the pro- 
gram. The granularity of these labels, e.g. function 
or source:linenumber, will determine the notion of 
similarity for later tabulation. This stream can be 
sorted with both PCT-specific and standard UNIX 
filters to produce a stream where text lines refer- 
ring to “similar” code locations are adjacent. A fil- 
ter can then aggregate counts over text lines with 
these “similar” suffixes, and hence the similarity de- 
. termines the level of aggregation. These aggregates 
are effectively histograms of program counter sam- 


ples over the address space of the programs. The 
histo gram bins are determined by the granularity of 
labels. E.g., function- granularity source coordinates 
will result in a report of the time spent in various 
functions. 


Users often find it easier to think about time frac- 
tions rather than raw sample counts. PCT supports 
this with a filter that totals the whole text stream 
and then re-emits it with counts converted to per- 
centages. These percentages are normalized to what- 
ever particular selection of counts is under consid- 
eration — either over multiple runs or over multiple 
ob jects or other combinations. 


The interface for users to these capabilities are sim- 
plified by simple shell script wrappers. For example, 
pct sym% /tmp/pct/ct/myprogram.*/libm* will 
create a function-level profile of time spent in the 
math library. 


There are a few PCT report styles that provide 
more context around the sampled program loca- 
tions. PCT can create entire copies of source-code 
files annotated with either counts or time frac- 
tions. We also have an Emacs mode much like 
grep-mode or compile-mode to drive examination 
of filename:line number profile reports. This mode 
allows a user to select report lines of high time frac- 
tion and automatically loads a buffer and warps the 
cursor to that spot in the code. 


6 Evaluation 


A few concerns arise in evaluating any profiling sys- 
tem. First, one must ask if profiling overhead is ob- 
trusive relative to real-time events. Large overhead 
could make results inaccurate. Bearing in mind that 
programs being profiled may be quite slow, exces- 
sive real-time overhead could dissuade programmers 
from using the system. A second concern is the ac- 
curacy of profiling numbers produced by the system. 
The following subsections discuss these issues. 


6.1 Overhead 


The overhead of any sampling system is tunably 
small (or large). There is a fundamental overhead- 
accuracy trade-off. The more frequently samples are 
taken, the more overhead incurred by interrupting 
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the program and recording the samples. However, 
the larger the sample rate the more accurate a pic- 
ture one acquires in a given amount of time. 


Large sample rates may be desirable. Some pro- 
grams run only briefly, but are CPU intensive while 
they run. Accurate probabilities may also motivate 
a fast sample rate. 


On modern CPUs, the cost of signal delivery and 
resumption of execution system call is typically less 
than 20 psec. Thus library-based sampling proce- 
dures are very low overhead. 


We have measured a gdb-based sampling as taking 
500..1000 psec on 700..1300 MHz Pentium III and 
Athlon-based systems running Linux, FreeBSD, and 
OpenBSD. The precise time varies depending on the 
complexity of parameter lists being decoded, the 
depth of the stack, the efficiency of the OS, and 
the CPU. However almost all of this overhead is 
gdb making many calls to ptrace()to reconstruct 
the argument lists of functions in the backtrace. It 
is possible to arrange a more minimally informative 
gdb sampling which does no address translation or 
decoding. This resulted in under 100 psec, including 
the round-trip context switch. 


These numbers are still relatively encouraging. For 
10 ms sampling granulanties the overhead is almost 
unnoticeably small unless quite rich samples are be- 
ing taken. At 1 ms sampling rates the overhead 
starts to become near a factor of two, but over- 
head is not prohibitive until near 100 psec rates. 
This also suggests that a debugging library could 
result in substantial overhead reduction by comput- 
ing only the necessary output. Alternately, debugger 
features could control the verbosity of output more 
finely. 


Also note that, at least on uniprocessors, it does not 
matter how many processes are being traced. Only 
one process has the CPU at a time. So some frac- 
tional overhead applies to the real time consumed 
by the entire system of processes. 


6.2 Accuracy 


Sampling overhead does not directly impact the ac- 
curacy of time estimates. Any constant amount of 
sampling overhead has the effect of simply increas- 
ing the sampling period. Hence the variance of time 
spent handling signals and recording samples might 
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impact the accuracy of program counts. In practice, 
with code that has known time fractions, sampling 
time variance seems to have a negligible effect. 


Assessing prease time fractions of a program creates 
a need for large sample sizes. The statistics of loca- 
tion counts are approximately binomial The pro- 
gram is suspended at some location, 2, with prob- 
ability p;. The mean of a binomial random van- 
able for N trials each of which has probability 
pi is simply Np;, while the standard deviation is 
/Npi(1 — p;). The frequency, p; = ni/N thus has 
a fractional error proportional to 1/VN. Suppose, 
for example, location 1 has count n; and location 
2 has count no. It is easy to show that a two stan- 
dard deviation test for the condition pi > po is ap- 
proximately nj — ng > 2,f/ni + no. E.g., to dedde 
Di > Po for no = 1 requires n; > 5. Fortunately, 
precise probabilities are usually less important than 
just identifying the expensive areas of a computa- 
tion. 


Reduced real-time performance is the most signifi- 
cant down-side of overhead. For very high sampling 
rates and very rich data collection a program can 
run much slower than inits native mode. 


Correlations between the time of sampling and 
paths in the program are more problematic. Con- 
sider the specific example of a function which takes 
almost exactly as long to execute as the time be- 
tween timer expirations. Also assume this function 
is repeatedly invoked and dominates the execution 
time. It should be clear that the program will al- 
ways be suspended near the same location. Compu- 
tation is distributed over all the code implementing 
this function. [17] discusses this problem in the con- 
text of CPU usage statistics. DCPI [8] addresses t his 
problem by randomizing the size of the time interval 
between samples. 


While profil() is inflexible in this regard, any 
other PCT oollection method optionally uses one- 
shot timers and re-installs timers with randomly 
spaced delaysin the alarm handler. Using large mul- 
tiples of a typical 10 ms time quantum are likely to 
seriously reduce the achieved sample size. However 
provided that successive periods are unpredictable, 
even a random alternation between 10 ms and 20 ms 
guards against a little acdadental synchronization. 
The fixed underlying timer clock makes truly pre- 
venting synchronization effects difficult. One cannot 
build a random interval from even random multiples 
of a ooarse intervals. Synchronization at the scale 
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of the underlying time quantum could still cause 
problems. ‘Truly random sample-to-samp le intervals 
clearly require specialized OS support. 


T Related Work 


Profiling is as old an art as writing programs. As 
with debugging, there has been a large amount of 
tool-building and research devoted to automating 
tasks that were originally done with hand-coded 
instrumentation. These prior profiling systems all 
take a more narrow view of profiling than the PCT 
philosophy of profiling as a type of debugging in 
which programmers apply the same familiar tools 
and preparations. 


Automatic instrumentation introduced the possibil- 
ity of collecting richer data than classic profil()- 
style samples, such as dynamic call graphs, ba- 
sic block activations, and control flow arc tran- 
sitions. Increasing complexity of software systems 
has driven a need for multi-process and even whole 
system profiling systems. Both static and dynamic 
profile- driven optimization have become a focus for 
those interested in performance. 


Additionally, some have considered limited notions 
of higher-level profiling [22] such as the implemen- 
tation of abstract data types or other alternative al- 
gorithm selection. This approach instrumented pro- 
grams to record, for example, where data is typi- 
cally inserted into an ordered list. It used this data 
to decide between array or linked-list representa- 
tions. Inserts at the beginning favor linked repre- 
sentations while those at the end favor arrays. PCT 
might be leveraged to answer similar questions with- 
out modifying the program to use an instrumented 
data structure library. 


There have been a large number of compile-time au- 
tomatic instrumentation systems, all of which are 
invasive, early-bound, and non-extensible. An early 
hierarchical profiler was gprof.{12] The programs 
tcov [6] and gcov [1] are similar to gprof, but 
instrument basic blocks instead of function calls. 
Many implementations of these types of profiler are 
not robust to improper program exits and are not 
tolerant of inadequate data in some objects. 


Many systems have also implemented some form of 
link-time instrumentation or post-link-time binary 
re-writing.[23, 14, 9, 15, 24, 20] These address re- 


building issues somewhat and have some weak ex- 
tensibility. A significant invasion of foreign code may 
remain, though. The code must be inserted to count 
executions, or, in more involved cases, log procedure 
arguments. 


Recently, a number of researchers have begun inves- 
tigating the limits of dynamic instrumentation — the 
re-writing of running executables. IBM has a system 
called DProbes [18, 10] which enables generic kernel- 
based late- bound instrumentation. Thisis similar to 
our debugger-controller based approach, but aims to 
build up a toolset for various machines and architec- 
tures rather than relying on the existing debugger 
infrastructure. Much lower overhead would likely be 
possible via this approach, but a great deal more 
work would need to be done to allow the sort of 
arbitrary expressions collectible with debuggers. 


Beyond instrumentation systems, there has been 
significant prior progress in PC-sampling-style pro- 
filing as well. There is the classic prof [4], and 
many latter-day counterparts. The most sophisti- 
cated system along these lines is probably the Dig- 
ital Continuous Profiling Infrastructure (DCPI).[8] 
DCPI has focuses on understanding how the mi- 
croarchitectural features of Alpha processors play 
out in full system applications. This system is un- 
fortunately proprietary and non-portable as many of 
its most impressive features rely upon CPU and OS 
support. The focus on low-level CPU behavior in- 
stead of high-level semantics of programs makes this 
system more useful for compiler writers and other 
assembly-level optimizations. For instance it does 
not support hierarchical call path samples along the 
lines of.{13] 


SGI’s IRIX-specific SpeedShop [2] system is proba- 
bly the closest system in spirit to PCT, though it 
stops short of a full debugger-profiler. It does sam- 
pled hierarchical profiling, has graceful report degra- 
dation, and is late- binding. However, in addition to 
being specific to IRIX on MIPS, it is restricted to 
dynamically linked executables, and fails to be ex- 
haustive in terms of kernel-resident and late-loaded 
code. Being proprietary, it is difficult to evaluate its 
extensibility. 


There have been many kernel-level profiling tools 
as well. Some system call tracers like strace -c 
support simple system call profiling.[7] Yaghmour’s 
Linux ‘Trace Toolkit [25] is a useful kernel-level 
event monitoring facality. PCT can leverage eas- 
ily accessible /proc/profile data on Linux. PCT’s 
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ptrace()-based techniques do not easily extend to 
kernel code since the necessary process contro] fea- 
tures are not typically available on the kernel itself. 


The type of non-interactive debugging PCT does 
is similar to the Expect system for controlling 
interaction.[16] Our debugger-controller is similar, 
but with configuration syntax tailored to profiling 
and the ability to handle large subtrees of processes 
simultaneously. dbctl1 also does not rely on the rel- 
atively slow logic and string processing of Tcl.[19] 
As noted earlier, dbctl1 also allows non-interactive 
process contro] other than state sampling. 


8 Availability 


PCT is freely available under an open source license. 
More information and current software releases can 
be obtained at the PCT web page: 


http://pdos.lcs.mit.edu/~cblake/pct 


In the realm of simple profiling, every UNIX after 
AT&T version 7 has support for profil() func- 
tionality, which provides at least some capability. 
Kermel profile integration is currently only avail- 
able on Linux. The 1ldd command on OpenBSD and 
FreeBSD is informative enough to allow tracking 
shared library usage and process tree profiling. 


Generalized profiling should be available on any 
system with a good source-level debugger for the 
programming languages of interest. Currently, gdb 
works well on the above mentioned systems as well 
as Solaris, HP-UX 9,10,11, AIX, Irix, SunOS, var- 
ious other BSD’s and probably many more plat- 
forms. Ports of our interaction scripts to dbx and 
other debuggers are under way. 


9 Conclusion 


Each year software systems grow in complexity from 
multiple code regions per address space to multi- 
process programs. Correctness becomes harder to 
achieve, and conventional wisdom is to postpone 
performance analysis as long as possible. PCT re- 
quires no more preparation than for debugging. This 
allows programmers to interleave the optimization 
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and debugging of their program however they see 
fit. 


PCT unifies access to a number of existing profil- 
ing features that have been available for some time 
and extends profiling in new directions. The PCT 
debugger-based profiling architecture substantially 
extends the sort of data that automatic profiling 
can collect. One can sample programmer-definable, 
context specific data. Such samples can often more 
readily expose higher-level algorithmic issues, such 
as a mismatch between program structures and user 
inputs. The overhead of PCT scales reasonably with 
the complexity of the program data being sampled 
and with sampling rates. 


Finally, PCT isvery portable by design, requiring no 
special CPU or OS features or support. Infor mative 
data can be gathered on most code in flexible ways. 
Reports can be generated flexibly based on various 
data aggregations while the programisstill running. 
These features usually require no recompiling, re- 
linking, or even re-starting of users’ programs. 
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Abstract 


Compression and differencing techniques can greatly improve storage and transmission of files and file 
versions. Since files are often transported across machines with distinct architectures and performance 
characteristics, compressed data should be encoded in a form that is portable and efficient to decode. This 
paper describes the Vcdiff encoding format for differencing and compression data and presents an empirical 


study showing its effectiveness. 


1 Introduction 


Data differencing computes a compact transformation to 
take a source file to a target file based on their differ- 
ences. Data compression compresses data in a single file. 
The UNIX utility diff is an example data differencing 
tool while compress and gzip are well-known data com- 
pressors. Differencing and compressed data are good for 
storage and transmission as they are often much smaller 
than the originals. Differencing and compression tech- 
niques are traditionally treated as distinct forms of data 
processing. Our work on the Vdelta compressor ([3, 4] 
showed that compression and differencing can be treated 
uniformly by unifying the Lempel-Ziv’77 string pars- 
ing scheme [14] and Tichy’s block-move technique [12]. 
This unification is called delta compression. 

Compressed dataneed to be encoded in a portable and 
efficient format so that they can be transported across a 
network such as the Internet which consists of diverse 
hardwareand software platforms. Many compressors are 
available, each with its own way to represent data. How- 
ever, little has been published on the encoding formats 
used by these compressors. A notable exception is gzip 
whose encoding format is published in the IETF Stan- 
dard Deflate [1]. Data differencing is much less devel- 
oped than data compression so there are few tools avail- 
able. The diff utility only works on text files and outputs 
editing commands to be processed by the UNIX line ed- 
itor ed. The only published format for differencing of 
binary data is the W3C Standard Gdi ff [13]. Outside of 
this work, there is no published encoding format for delta 
compression, i.e., a format suitable for both compression 
and differencing. Given the intended applications, we 
stipulate that such a data encoding format should have 
the following attributes: 


Algorithm independence: The encoding format 
must be independent from the algorithms used to 
compress data. This allows a receiver to decode 
compressed data without having to know how it was 
computed. 


Data portability: The encoding format must be free 
from hardware architecture issues such as byte or- 
der and word size. This allows a receiver to decode 
data without knowing the architecture of the encod- 
ing machine. 


e Output compactness: The encoding format must 
compactly represent compressed and delta data. It 
should also be transparently extensible by encoders 
to maximize compression efficiency. 


Decoding efficiency: The encoding format must be 
decodable on machines with limited computational 
power and memory. This is important for web- 
based applications with small clients such as PCs 
or hand-held devices. 


The mentioned Vdelta software was instrumental in 
the work to extend HTTP! .1! for Delta Encoding [9, 10]. 
However, the encoding format used by Vdelta was not 
sufficiently compact for compression data (i.e., when sin- 
gle files are compressed) and not easily extensible. Since 
then, we have designed a new format Vcdiff for delta 
compression. This format incorporates a number of in- 
novations that enable compact data representation with 
extensibility to exploit application-specific knowledge in 
gaining further compression. This paper discusses the 
essential elements of the Vcdiff encoding format and 
presents performance data showing its effectiveness. A 
full description of the format is given in a current IETF 
Proposed Standard [6]. 
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2 Algorithm independence 


Techniques such as Vdelta, Lempel-Ziv and block-move 
are based on string matching algorithms [4, 7, 8] to find 
matches either across files or in the same file. Each such 
match can then be compactly encoded by its location and 
length. Different string matching algorithms will typi- 
cally find different matches. 

String matching algorithms often stress memory re- 
sources so, On current computers, they are not effective 
for processing large files in the order of hundreds of 
megabytes or gigabytes. To deal with this, a target file 
can be partitioned into sufficiently small contiguous seg- 
ments of data called target windows, each of which is 
to be compressed separately. To improve compression, 
such a target window may be compared against some 
source window, a contiguous segment of data from ei- 
ther the source file or the target file itself. In the latter 
case, the source window is required to come from some 
part of the target file preceding the current target window 
so that, during decoding, the data for such a window is 
well-defined. Finding the right matching source window 
for a given target window is crucial for compressing data. 
Algorithms to do this are called window ing algorithms. 

String matching and windowing algorithms clearly af- 
fect compression effectiveness. However, from the point 
of view of designing an encoding format, it is prefer- 
able to abstract away the details of such algorithms. In 
this way, simple and generic decoders can be constructed 
without knowing how the data was encoded. An addi- 
tional benefit is that software vendors and/or researchers 
can continue improving the encoding algorithms without 
affecting the receivers of compressed data. We discuss 
how to design such an encoding format next. 


2.1 Windowing data 


Source and target windows may have different sizes but 
their sizes are chosen so that they can be processed en- 
tirely in memory. For data differencing, the traditional 
method simply aligns source and target windows by file 
offsets. For data compression, the popular rolling win- 
dow method uses a small data segment immediately be- 
fore the target window as the source window. These al- 
gorithms work well with small files since the window 
choices are limited (so they are mostly right by fiat) but 
they are suboptimal for large files as matching data may 
occur randomly and much further apart. In a work-in- 
progress, Vo explored a content-based method [11] to 
find source windows that would likely match well with 
given target windows (Section 5). Regardless of what 
window selection algorithm is used, a decoder does not 
need to know about it as long as the encoding format 
records the following data about source windows: 
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e Anindicationof whether the source window is from 
the source file or the target file, 


e The starting position of the source window in the 
respective file, and 


e The length of the source window. 


Given this basic data, a decoder can obtain the appro- 
priate source data to be used with the compressed data to 
decode a target window. Next we discuss what comprises 
compressed data. 


2.2 String matching and delta instructions 


When a target window T with size ¢ is compressed given 
a source window S of size s, we shall think of S and T 
as substrings of a superstring U formed by concatenating 
them like this: 


S05}.--Ss—1 701}... T¢-1 


The address of a byte in S or T is referred to by its 
location in U. Thus, for any k < 2, the address of Tx 
is s + k. The compressed data consists of a sequence 
of instructions called delta instructions. There are three 


types: 


e ADD: This instruction has two arguments, a size 
and a sequence of .\ bytes to be copied. 


e COPY: This instruction has two arguments, a size A 
and an address a in the string U. These arguments 
specify the substring of U that must be copied into 
the target window being constructed. For program- 
ming convenience, we assert that such a substring 
must be entirely contained in either S or T. 


e RUN: This instruction has two arguments, a size A 
and a byte that will be copied A times. 


Let .\(z) be the size of any delta instructionz and a(1) 
the associated address if 2 is a COPY. Let J = 2)29...2, 
be a sequence of delta instructions. Then each instruc- 
tion 7, encodes a data segment o{i,) of size A(i,). Let 
P = diicm<k—1 A(im). We say that I is a faithful rep- 
resentation of T if: 


e Forall k, o(2,) is equal to the substring of T starting 
at p with size A(z, ); 


e Ifz, isa COPY, then a(t.) < p+; and 


© iem<n (im) is equal to the size of T. 
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1. Setp =Oandk =0. 


2. If 2% 1s a RUN instruction, copy the associated data 
byte A(i,) times to T' starting at p. 


3. If2, 1s an ADD instruction, copy the associated data 
to T' starting at p. 


4. If 2, 1s a COPY instruction, 


(a) If a(t,) < s, copy A(z.) bytes from S starting 
at a(z,) to T starting at p; 


(b) Else, copy A(i,) bytes from 7 starting at 
a(2,) — sto T' starting at p. 


5. Settp=pt+A(i,) andk =k +1. 
6. If k <n, go to2. 


Figure 1: Decoding delta instructions 


Let S be a source swing of size s, T a target window 
of size ¢ and J a faithful representation of YT. Figure 1 
shows the algorithm to reconstruct 7’ from J and S. A 
string copy operation is assumed to be carried out from 
left to right so that Step 4.b is well-defined. Since the 
total running time of the algorithm is proportional to the 
number of bytes copied, the below result immediately 
follows from the definition of a faithful representation: 


Theorem 1 A target window encoded with a faithful se- 
quence of delta instructions can be decoded in O(t) time 
and space where t is the size of the window. 


S: abcdefghijklmop 
T: abcdwxyzefghefghefghefghzzzz 


COPY #4, i 
ADD 4, WXYZ 
GOPX, 4, a 
COPY, 42, 24 
RUN 4, 2z 


Figure 2: Delta instructions transforming S into T' 


Figure 2 shows example source and target windows 
and a sequence of delta instructions encoding the target 
data. It is easy to verify that this sequence ts a faithful 
representation of 7’. The first COPY instruction copies 4 
bytes from address 0, 1.e., the string abcd in the source 
window S. Next is an ADD instruction that adds the 4 


specified bytes wxyz. Note that the fourth instruction 
copies data from T itself since address 24 is position 8 in 
T’. This instruction also shows that the data to be copied 
can overlap with the data being copied from as long as 
the latter starts earlier. This enables efficient encoding 
of periodic sequences, 1.e., sequences with regularly re- 
peated subsequences. The final RUN instruction com- 
pactly encodes the last four bytes of 7’. 

Given a pair of target and source windows, there are 
usually many different faithful representations of the tar- 
get data. For example, the target data in Figure 2 can 
also be faithfully represented with a single ADD instruc- 
tion that includes all the data. From a compression point 
of view, it is desirable to find the representation that 
requires the least number of bytes to encode. Unfor- 
tunately, this problem is NP-hard even when sizes and 
addresses are encoded with some fixed number of bits 
(SR22 and SR23 in Garey and Johnson [2]). This sit- 
uation improves when relaxed to just finding the min- 
imum number of delta instructions without worrying 
about whether or not their encoding minimizes the com- 
pressed output. In this case, greedy approaches such as 
Lempel-Ziv parsing or Tichy block-move [12] do com- 
pute the minimal number of delta instructions in linear 
time and space given appropriate string matching algo- 
rithms [7, 8]. The Vdelta algorithm [4] relaxes this min- 
imality to trade for faster string matching and less work- 
ing memory. In any case, the point with delta instruc- 
tions is that, no matter how they are computed, Theorem 
| guarantees that a generic decoder can be written that 
always runs in linear time and space. 


3 Data portability 


The Vcdiff encoding format is byte-oriented. Each byte 
is limited to its lower eight bits for portability. The bits 
in a byte are ordered from right to left so that the least 
significant bit (LSB) has value 1, and the most significant 
bit (MSB), has value 128. 

Sizes and file offsets are unsigned integers encoded 
via a portable variable-sized format (originally intro- 
duced in the Sfio library [5]). This encoding treats an 
unsigned integer as a number in base 128. Then, each 
digit in this representation is encoded in the lower seven 
bits of a byte. Except for the least significant byte, other 
bytes have their most significant bit turned on to indicate 
that there are still more digits in the encoding. The two 
key properties of this integer encoding that are beneficial 
to a data compression format are: 


e The encoding is portable among systems using 8-bit 
bytes, and 


e Small values are encoded compactly. 
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Below is the encoding of the integer 123456789 in 
four 7-bit digits whose values are 58, 111, 26, 21 in order 
from most to least significant. In the 8-bit representation 
of these digits, the MSBs of 58, 111 and 26 are on. 


10111010 
| MSB+58 


11101111 
MSB+ 11] 


10011010 
MSB+26 


00010101 
0+21 


4 Encoding delta instructions 


The delta instructions represent string matching results. 
In data differencing applications of text files, changes be- 
tween source and target data are often small, resulting 
in long common substrings. When that is the case, any 
straightforward representation of the delta instructions 
would be adequate. However, for differencing of binary 
files or general compression, matched substrings are of- 
ten short so that the delta instructions must be encoded 
well to achieve good compression rates. The key to com- 
pact encoding revolves around the questions of how to 
encode addresses of COPY instructions efficiently and 
how to deal with instructions having small sizes or lim- 
ited number of sizes. This leads to the ideas of address 
encoding modes and instruction code tables which are 
discussed next. 


4.1 Address caches and encoding modes 


Data in local regions are often replicated with minor 
changes. This is especially true in data differencing 
where target files are created from small changes in 
source files. Thus, the addresses of successive COPY 
instructions often occur close by or even exactly equal 
to one another. To take advantage of this phenomenon, 
Vcdiff maintains two types of address caches: 


e A near cache is an array with s_near slots of previ- 
ously matched addresses. An address p can be en- 
coded against a cached address g as p — g if p > . 


e A same cache is an array with s_same * 256 slots 
of previously matched addresses. If an address p is 
equal to same[p%(s_same * 256)], then p can be 
encoded with the single byte value p%256. 


It is clear that an encoder and a decoder must be in 
synch with respect to maintaining the address caches. 
The protocol to enforce this is as follows: 


1. Before processing (i.e., encoding or decoding) a tar- 
get window, all cache slots are initialized to zero. 


2. After processing each COPY instruction, its address 
p is used to update the caches as follows: 
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(a) The slots in the near cache are managed as 
a circular buffer with a current index. The 
address p is first added to near[index], then 
index is incremented modulo s_near. 


(b) The same cache is a hash table of size 
s.same * 256. The address p is added to 
same|[p%(s..same * 256)]. 


In the above cache usage, the address encoding mode, 
1.e., the manner in which the address p of a COPY in- 
struction 1s encoded must be recorded in the encoding 
data. Let here be the current position in the target data 
(i.e., the start of the data about to be encoded or de- 
coded). Below are the address modes: 


e VCD_SELF: This mode has value O and indicates 
that p was encoded as itself. 


e VCD..HERE: This mode has value |] and indicates 
that p was encoded as here — p. 


e Near: There are s_near modes in the range 
[2,s.near + 1]. If m is the mode of the address 
encoding then p was encoded as p — near[m — 2]. 


e Same: There are s.same modes in the range 
[s_near+2, s.near+s_same+1]. If m is the mode 
of the encoding then p was encoded as a single byte 
b such that same[(m — (s_near + 2)) * 256 + 5] is 
equal to p. 


By default, Vcdiffuses 4 for s_near and 3 for s_same 
resulting in a total of 9 different addressing modes. 


4.2 Instruction code tables 


Successive delta instructions often represent short 
matches separated by smal] amounts of unmatched data. 
So the sizes of the COPY and ADD instructions are often 
small. This is particularly true of binary data such as ex- 
ecutable files or semi-structured data such as HTML or 
XML. In such cases, it is beneficial to combine sizes, in- 
struction types and even successive pairs of instructions. 
The effectiveness of such combinations depend on many 
factors including the data being processed and the string 
matching algorithm in use. For example, in a case where 
many COPY instructions with the same data sizes are 
generated, it may be worth encoding these instructions 
more compactly than others. 

To maintain independence from the choices made in 
encoding algorithms, we introduce the notion of instr uc- 
tion code tables, each of which consists of 256 entries. 
These entries describe combinations of sizes, instruction 
types and pairs of instructions. The encoder and de- 
coder(s) of a compressed dataset must share the same 
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table. Then, the encoding only records the indices of the 
table entries, each of which fits in a single byte. 

As depicted below, an entry in an instruction code ta- 
ble conceptually consists of two triples, each of the form 
(inst,Size,mode): 


| sized | 





mode 2 | 


e inst: This field can be one of: NOOP, ADD, RUN 
or COPY to indicate the instruction types. NOOP 
means that no instruction is specified. 


e size: This field is either zero or positive . Zero means 
that the size associated with the instruction is en- 
coded separately in the encoding data. A positive 
value defines the actual data size so the encoding 
data will omit tt. 


e mode: This field is significant only when the in- 
struction type is COPY. It defines the encoding 
mode used to encode the associated address. 


Thus, each entry in the instruction code table can en- 
code a single instruction (one of the triples is a NOOP) 
or two successive instructions. Vcdiff itself defines a de- 
fault instruction code table for the case when the near 
cache has 4 slots and the same cache has 3 * 256 slots. 
Thus, there are 9 address modes for COPY instructions. 
The first two are VCD_SELF(O) and VCD_HERE(1). 
Modes 2, 3, 4 and 5 are for addresses coded against the 
near cache. And, modes 6, 7 and 8 are for addresses 
coded against the same cache. This default table is as- 
sumed to be available with each encoder and decoder. 
The Vcdiff encoding format also allows an encoder to 
define its own custom code table but then it has to en- 
code this table in the data itself [6]. 

Table 1 depicts the default instruction code table. 
Each numbered line represents one or more entries (re- 
call that an entry in the instruction code table may rep- 
resent up to two combined delta instructions). The last 
column (“Index”) shows which index value or range of 
index values of the entries covered by that line. The first 
6 columns of a line in the depiction describe the pairs of 
instructions used for the corresponding index value(s). 
For example, line | shows the single RUN instruction 
with index O. As the size field is 0, this RUN instruc- 
tion always has its actual size encoded separately in the 
encoding data. Line 2 shows the 18 single ADD instruc- 
tions. The ADD instruction with size field 0 (1.e., the ac- 
tual size 1s coded separately) has index 1. ADD instruc- 
tions with sizes from | to 17 use code indices 2 to 18 and 
their sizes are as given (so they will not be separately en- 
coded). Lines 12 to 21 show the pairs of instructions that 
are combined together. For example, line 12 depicts the 
12 entries in which an ADD instruction is combined with 


an immediately following COPY instruction. The entries 
with indices 163, 164, 165 represent the pairs in which 
the ADD instructions all have size 1 while the COPY in- 
structions have mode VCD_SELF(0) and sizes 4, 5 and 6 
respectively. 

Table 2 shows two different encodings of the delta in- 
Structions from Figure 2: Plain and Optimized. In the 
Plain encoding, each instruction was simply encoded. 
For example, the first COPY instruction used code in- 
dex 19 so its size and address were separately encoded 
entailing a total cost of three bytes. Similarly, the ADD 
instruction used index | with separately encoded size so 
the cost was 6 bytes. Altogether, the Plain coding used 
18 bytes which substantially improved over the original 
data size of 28 but was not optimal. 

In the Optimized encoding, the first COPY instruction 
used code index 20 with implicit size 4. Thus, its encod- 
ing took only two bytes. The second and third instruc- 
tions, ADD and COPY.,, were combined via code index 
172 with both sizes implicitly defined. Thus, both in- 
structions were encoded in 6 bytes instead of the original 
9 bytes in the Plain encoding. Altogether, the size of the 
Optimized encoding improved to 13 bytes. 

The Optimized encoding shows that judicious use of 
instructions with implicitly defined sizes and combined 
instructions can substantially improve the compression 
rate. We discuss next how to compute the optimal en- 
coding given a fixed code table. 


4.3 Optimizing instruction encoding 


Section 4.2 showed that an encoder has a wide latitude 
in choosing when and how to combine and encode delta 
instructions. In fact, for any fixed instruction code table, 
one can optimize the encoding of a sequence of delta in- 
structions using dynamic programming. Toward this end, 
let J = 1479...1,, be a sequence of delta instructions. We 
shall use J; to denote the subsequence of J starting from 
the k*” instruction and extending to the end of J. For ex- 
ample, J = J,. We define J, to be the empty sequence 
whenever k > n. 

The code entries in an instruction code table assumes 
that the addresses of COPY instructions and the data of 
ADD and RUN instructions are always coded separately. 
Thus, to optimize the encoding, we only need to con- 
sider the sizes of the instructions and their types, 1.e., 
ADD, RUN, COPY and any addressing modes. Now, for 
each instruction 2, let the cost of 2, c(z), be the number 
of bytes required to encode z and its size using the best 
choice from the instruction code table. We assume that 
the instruction code table has been defined so that there 
is at least one way for doing this. Likewise, for any two 
consecutive instructions z and 9, let the cost c(z, 7) be the 
number of bytes required using the best table entry that 
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Table 1: The default instruction code table 


[Type | Size | Mode lI Type Size 

pijfrun {oo [0 |[Noop/ 
J ADD | 0, (1,17) | 0 | 

| 


Mode | Index 

















































| | NOOP 0 0 
0, [4,18] | 0 || Noop| o 0 [19,34] 
| | 1 || Noop| 0 [35,50] 
1[-2 [woop | —0 [0 _[-tst,66) 
, [4,18] | 3 || Noop | o o || [67,82] | 
0, [4,18] | 4 || noop| o 0 [83,98] 
Oe 5 NOOP | 0 o {{ (99,114) | 
Oo, [4,18] | 6 || Noop|] o O- | its. 07 
0, [4,18] 0 0 [131,146] 
i O, [4,18] 8 NOOP | 0 Q . | i462) | 
, C2 eeD | [ae41_ | © jf copy | (4.67 [0 [163,174] | 
a3 app | 1.4) ] 0 Pa |-a7s.186) 
14 || ADD Ea) 0 || copy | [4,6] 2 || (187,198) 
15 || ADD a4) iy 0 3 Gort | i4ne) | 2 [199,210] 
16 a | Gopy] [4,6) { 4 §] rieo0) 
| 17 || ADD [1,4] | @ _| Gopy | (4,6) | 6 [223,234] 
Ee ee ee ae i: ee ee 0: [235,238] | 
ADD | (is4) [| 0 com, «4 | 7% Wh tess47) 
[20 ap | (4) | 0 [copy] 4 |e |] (243,246) 
rai [copy [| 4] 10,8) [| apo [1] 0 127.2581 


Table 2: Encoding the delta instructions in Figure 2 


RUN 


combines both instructions. If there is no way to com- 
bine z and 7, we let c(i, 7) be infinite. Finally, let C(7) 
be the cost of encoding the sequence J. Then, CJ) can 
be obtained by solving the following dynamic program: 


0 if I = ¢; 
CO(1) =< (4) if |Z] = 1; otherwise, 
min{ce(ii) + C(J2), c(i1, i2) + CUs)}. 


The first case states that the cost of encoding an empty 
sequence is 0. The second case states the cost of encod- 
ing a sequence with a single instruction. The last case 
computes the optimal cost by minimizing between two 
alternatives: encoding the first instruction by itself or 
combining the first two instructions. In each alternative, 
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recursion is used to deal with the rest of the sequence. 

Since each C'(J,,) is uniquely determined by the se- 
quence J/;, we can keep track of all processed subse- 
quences in O(|J|) space so that the recursion can be 
pruned whenever it arrives at a processed subsequence. 
We have shown: 


Theorem 2 Given a fixed instruction code table, any se- 
quence of delta instructions I can be encoded optimally 
in O(|Z|) time and space. 


In addition to optimizing delta instruction encoding 
given a fixed code table, it is also possible for an ap- 
plication to define its own code tables inside the encod- 
ing data [6]. This enables an application to gain further 
compression by specially treating certain instructions or 
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pairs of instructions that are much more popular than oth- 
ers. We are investigating the question of how to compute 
such an optimal instruction code table. The mentioned 
IETF Proposed Standard [6] also discusses the use of 
secondary compressors to further compress the instruc- 
tion encoding. 


5 Performance 


Weshow the effectiveness of the Vcdiff encoding format 
in two ways. The first set of experiments was based on 
three different source code archives of the Gnu C com- 
piler, gcc-2.95.1.tar, gcc-2.95.2.tar and gcc-2.95.3.tar. 
We used the Vcodex/Vcdiffsoftware (Section 8) to com- 
pare various options of Vcdiff against the gzip and com- 
press tools. These files were very large so that some win- 
dowing scheme must be used. In the second set of ex- 
periments, we collected the home page of www.cnn.com 
every hour for 10 days and computed the deltas using 
various methods. 


5.1 Comparing with compress and gzip 


We compared Vcdiff against compress and gzip using 
the mentioned three source code archives of the GNU C 
compiler. The experiments were done on an SGI-MIPS3, 
400MHZ. Timing results were obtained by running each 
program three times and taking the average of the total 
cpu+system times. Below are the diffierent Vcdiff runs: 


e Vcdiff-c: Vcdiff was used for compression only. 
That is, no source file was used. This directly com- 
pared Vcdiff, gzip and compress as compressors. 


e Vcdiff-d: Vcdiff was used for differencing only. That 
is, matching was allowed only between source and 
target data. Windows were simply matched by po- 
sitions across target and source files. 


e Vcdiff-dc: This is similar to Vcdiff-d but matching 
within target data, was allowed, 1.e., delta compres- 
sion was used. 


e Vcdiff-dcw: This is similar to Vcdiff-dc but a 
content-based windowing algorithm [11] was used 
to select source windows more likely to match with 
given target windows. Thus, file offsets of source 
and target windows would seldom align. 


Table 3 shows the experimental results. Note that 
compression times were typically dominated by the 
Swing matching and encoding algorithms. For example, 
the large time variation in the Vcdiffrows was strictly due 
to the windowing and string matching algorithms used in 
the Vcodex/Vcdiff software. Such measurements were 


somewhat irrelevant from the point of view of evaluat- 
ing an algorithm-independent encoding format. How- 
ever, the decompression times were indicative of how the 
different formats would perform in practice. 

The pure compressor Vcdiff-c gave worse compres- 
sion rate than gzip but better than compress. How- 
ever, it always decompressed fastest. Version gcc.2.95.2 
was similar to version gcc.2.95.1. Thus, compressing 
gcc.2.95.2 given gcc.2.95.1 gained up to a factor of 500 
in size reduction as shown in the last three rows. On the 
other hand, the files in the archive gcc.2.95.3 were were 
sufficiently changed and rearranged from gcc.2.95.2 so 
that simply matching source and target windows by file 
positions were ineffective. As a result, Vcdiff-d and 
Vc diff-dc did not perform well even though delta com- 
pression did help Vcdiff-dc to beat Vcdiff-c and come 
close to gzip. Vcdiff-dcw still worked well due to the 
content-based windowing algorithm. There was a clear 
time cost for using such an algorithm during encoding 
but decoding time was not affected. 

Finally, the cat row of the table shows the times 
required to just copy the files gcc.2.95.2.tar and 
gcc.2.95.3.tar, respectively 1.08 and 1.05 seconds. Thus, 
in the best case of Vcdiff-dcw, decompression times were 
only about 70-90% worst than plain copying of the data. 
The dramatic size reduction meant that, with an appropri- 
ate encoder, the Vcdiff encoding format presented a good 
mechanism for transporting data without taxing client 
machines on decoding. 


5.2 Compressing a set of web pages 


We collected the home pages of www.cnn.com every 
hour starting at 12:00AM on 10/23/2001 and ending at 
11:00PM 11/01/2001. The below methods for delta com- 
pression were used: 


© diff+ gzip: This method runs the diff -e program to 
compute the differences, then pipes the result to 
gzip for further compression. 


e ediff: We instrumented the Vcodex/Vcdiff software 
torun the Vcdiff string matching algorithm but out- 
put results in the Gdi ff format. 


e ediff+egzip: This is like the above but the result is 
piped to gzip for further compression. 


e vcdiff: This uses the Vcdiff encoding format. 


We ran two different experiments. In the First exper- 
iment, each file is compressed against the first file col- 
lected while, in the Successive experiment, each file is 
compressed against the one in the previous hour. Table 4 
summarizes the compression results. The raw row shows 
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Table 3: Comparing Vcdiffto gzp ard compress using the gcc-2.95.[ 123] archives 








Leait( Jp,/ 97,160) | 2.95.3 (55,787,520) 


Decomp.(s) || __Size __| Comp.(s) | Decomp.(s)_ 


Compressor 











cat | 55,797,760 | 1.08 | 1.08 |] 55,787,520 | 1.05 | 1.05 
compress 19,939,390 | 13.85 7.09 || 19,939,453 | 
(12973443 | 4299] _5.35 || 12,998,097 | 
__ Vediff-c 1.37177 20.09 


















2.95.2 given 2.95.1 (55,746,560) || 2.95.3 given 2.95.2 (55,797,760) 
Vediff-d || 100971 10.93 26,383,849 | 71.41 | 6.41 | 


Vcdiff-de 
Vcdiff-dcw 


| 97,246 20.03 14,461,203 4.82 
| 256,445 44.81 1.84 || 1248543] 61.18] = 199 | 


Table 4: Delta compression of www.cnn.com 
First 


| | Successive 
Min. | Max. wet Min. | Max. Avg. 
[raw || 44,602 | 50,033 | 46,036 || 44,602 | 50,033 | 46,036 
L017 | 


C gaff [| 11] 5597] 4277) 11 | 17e7] 458 
voaif 4209 385 
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the Minimum, Maximum and Average sizes of the col- 
lected files. Later rows show the same statistics for the 
compressed data. Delta compression was effective in re- 
ducing the data sizes overall. The best method, vcdiff 
reduced data by a factor of about 15 in the First expen- 
ment and about 120 in the Successive experiment. 


Figure 3 shows in detail the sizes of the compressed 
data in the First experiment. The order of the methods 
from worst to best was diff+gzip, ediff, ediff+gap and 
vediff. Since gdiff and vcdiff were based on the same un- 
derlying al gorithms to compute delta instructions, the re- 
sults compared directly the effectiveness of the different 
encoding formats. Even with the additional compression 
step using gap (i.e., the Deflate format) the edifft gap 
results were still slightly worse than vediff. The fact that 
delta compression was still effective after a fairly long 
duration of 10 days suggested that these pages were gen- 
erated from some large template t hat seldom chan ged. 


Figure 4 shows results from the Successive experi- 
ment. The diff+gzip data fluctuated wildly because diff 
was line-onented and could not handle small changes 
made on many lines. Due to this large fluctuation, the 
format used in Figure 3 did not show the data well. 
Therefore, we plotted all data points relative to the ones 
from vediffby simply dividing eachdata point by the cor- 
responding value from vediff. Thus, the flat horizontal 
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line at 1 represented vedi ff data. The encoding formats of 
eup, ediff and vcdi ff required fi xed overhead sets of bytes 
(shown in the Min. columns in Table 4 as the compared 
files were identical in that case). We subtracted such 
overheads from the data points before dividing to remo ve 
the large distortion in the ratios when files chan ged little. 
Vcd iff w as again the best encoding format for delta com - 
pression in this experiment. In fact, Table 4 showed that 
it typically reduced data by more than 2 orders of mag- 
nitude since files chan ged very little in successive hours. 
Timing results were not shown in the above exper- 
iments but diff+ gzp and gdiff+gup were much slower 
than vedi ff and gdiff because of the use of multiple pro- 
cesses and, in the case of diff+ gzip, slow text alignment 
algorithms. For vediff and edi ff, it was also hard to ob- 
tain meaningful measurements since the files were small 
and the encoding al gorithms were sufficiently fast so that 
process start-up time became the dominating factor. 


6 Summary 


We described Vcdiff, a general and portable encoding 
format for delta compression, i.e., combined compres- 
sion and differencing. This is the first fully described 
encoding format for this type of data processing. Vcd- 
iff introduced the novel idea of an instruction code ta- 
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ble to allow combining delta instructions to optimize 
compression rate. We showed how to compute minimal 
encodings given a fixed code table using dynamic pro- 
gramming. More importantly, the nature of the encoding 
format enables construction of decoders free from any 
knowledge of encoders and guaranteed to run in linear 
time and space. Thus, Vcdiff is suitable for web-based 
client-server applications in which a big server can send 
data to much smaller clients with different hardware ar- 
chitectures. We presented performance results showing 
that Vcdiff compares favorably to other formats for data 
differencing including the W3C Gdiff Standard and the 
use of diff and gzip. Vcdiff is the subject of a current 
IETF Proposed Standard [6}. 
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8 Code availability 


The Vcdiff data format described here is free. from 
any patent claims. An implementation of Vcdiff 
is available as a part of the Vcodex package writ- 
ten by Phong Vo. The code can be obtained from 
http://www.research.att .com/sw/tools. 
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A Precise and Efficient Evaluation of the Proximity between Web Clients 
and their Local DNS Servers 
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Oliver Spatscheck, and Jia Wang 
AT&T Labs—Research 


Abstract 


Content Distribution Networks (CDNs) attempt to 1m- 
prove Web performance by delivering Web content to 
end-users from servers located at the edge of the net- 
work. An important factor contributing to the perfor- 
mance improvement is the ability of a CDN to select 
servers in the proximity of the requesting clients. Most 
CDNs today use the Domain Name System (DNS) to 
make such server selection decisions. However, DNS 
provides only the IP address of the client’s local DNS 
server to the CDN, rather than the client’s IP address. 
Therefore, CDNs using DNS-based server selection as- 
sume that clients are “close” to their local DNS servers. 


To quantify the proximity between clients and their local 
DNS servers, we propose a novel, precise, and efficient 
technique for finding the associations of client to local 
DNS servers. We collected more than 4.2 million such 
unique associations in three months. From this data, we 
study the impact of proximity on DNS-based server se- 
lection using four different proximity metrics. We con- 
clude that DNS is good for very coarse-grained server 
selection, since 64% of the associations belong to the 
same Autonomous System. DNS is less useful for finer- 
grained server selection, since only 16% of the client and 
local DNS associations are in the same network-aware 
cluster [13] (based on BGP routing information from a 
wide set of routers). As an application of this method- 
ology, we evaluate DNS-based server selection in three 
of the largest commercially deployed CDNs to study its 
accuracy. 


1 Introduction 


Creating and managing a high-performance, Internet- 
scale Web service is a formidable challenge involving 


*Zhuoqing Morley Mao (email: zmao@cs.berkeley.edu) is a Com- 
puter Science graduate student at University of California, Berkeley. 
This work was done during her intemship at AT&T Research Labs. 

* Current affiliation: IBM Research 


deployment of multiple Web servers in strategic loca- 
tions throughout the network. The introduction of Con- 
tent Distribution Networks (CDNs) has allowed organi- 
zations to overcome this challenge by outsourcing the 
distribution of their Web content. With CDNs, content 
providers need only to supply an origin Web server — 
the CDN distributes the content to end users through a 
set of CDN servers it has deployed in the network. Ide- 
ally, this reduces Web response time and download la- 
tencies in addition to providing overload protection and 
bandwidth savings. 


Ina well-designed CDN, servers are placed to avoid con- 
gested links and slow network paths. When a Web client 
requests content, the CDN dynamically chooses a server 
to route the request to, usually one that is appropriately 
close to the client. Note that this dynamic CDN re- 
quest routing is an extra step that is not necessary for 
stand-alone Web servers. Efficient CDN server selec- 
tion allows CDNs to overcome the extra overhead of the 
dynamic routing step by taking advantage of improved 
connectivity to the end user. CDN server selection ap- 
plies for both static and dynamic content. In the latter 
case, content can be dynamically assembled at the edge 
servers [1]. 


CDNs typically perform dynamic request routing using 
the Internet’s Domain Name System (DNS) [11]. The 
DNS is a distributed directory whose primary role 1s to 
map fully qualified domain names (FQDNs) to IP ad- 
dresses. To determine an FQDN’s address, a DNS client 
sends a request to its local DNS server. The local DNS 
server resolves the request on behalf of the client by 
querying a set of authoritative DNS servers. When the 
local DNS server receives an answer to its request, it 
sends the result to the DNS client and caches it for future 
queries. Each DNS record has a time-to-live (TTL) field 
thattells the local DNS server how long it may cache the 
result. 


Normally, an authoritative DNS server’s association 
from FQDNs to JP addresses is static. However, CDNs 
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use modified authoritative DNS servers for CDN server 
selection. The results of a DNS query to one of these 
DNS servers may vary dynamically depending on fac- 
tors such as the source of the request and the condition 
of the network. Typically, the CDN’s authoritative DNS 
server maps the client’s local DNS server address to a 


geographic region within a particular network and com- 


bines that with network and server load information to 
perform CDN server selection. To enable fast reaction 
to dynamic resource changes, the answer returned by the 
CDN’s DNS server has a small TTL. This approach is 
largely transparent to the client, and works for any Web 
content (including both HTML and streaming media). 


Although DNS-based server selection is transparent and 
general, it has two inherent limitations [15, 4]. First, it 
is based on the implicit assumption that clients are close 
to their local DNS servers. The CDN DNS server per- 
forming dynamic request routing only has access to the 
client’s local DNS server’s IP address—it does not know 
the client’s own IP address. However, the assumption 
that clients are close to their local DNS server may not 
be valid. For example, the client might be using a lo- 
cal DNS server hierarchy in which the outermost local 
DNS server that communicates with authoritative DNS 
servers may be far removed from clients; the client may 
have been configured with a local DNS server which is 
far away; or the client may be using a secondary local 
DNS server that is more distant from it than its primary 
local DNS server. Therefore, using only the local DNS 
server information to select CDN servers has the inher- 
ent risk of selecting a server farther away from the client 
than other available CDN servers. 


The second inherent limitation of DNS-based server se- 
lection is that a single request from a local DNS server 
can represent differing numbers of Web clients — this 
is called the hidden load factor [8]. The hidden load 
has implications on a CDN’s load balancing algorithm. 
For example, a DNS request from a local DNS server 
of a large ISP may result in many more Web requests 
than a DNS request from a local DNS server of a small 
site. CDNs need to be able to properly weigh individual 
DNS requests to distribute Web requests among its CDN 
servers. If the hidden load factors are known, load bal- 
ancing algorithms described by Colajanni, et al. [7, 8] 
can be easily deployed to achieve better load distribu- 
tion. On the other hand, if the hidden load factors are 
not known, fine-grained request distribution may be dif- 
ficult. 


We study the extent of the first limitation and its impact 
on CDN server selection. To this end, we developed a 
simple, non-intrusive, and efficient mapping technique 
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to determine the associations between clients and local 
DNS servers. We deployed this technique on several 
sites to collect an extensive data set which we use to 
study the impact of proximity on DNS-based server se- 
lection using four different proximity metrics. We con- 
clude that DNS is good for very coarse-grained server 
selection, since 64% of the associations belong to the 
same Autonomous System (AS). DNS is less useful for 
finer-grained server selection, since only 16% of clients 
use DNS servers in the same network- aware cluster [13] 
(based on BGP routing information). We also measure 
the CDN server distribution of several real-world CDNs 
to evaluate whether the proximity of a client to its local 
DNS server leads to potentially suboptimal CDN server 
selection decisions in practice. Our technique could also 
be used to determine hidden load factors by associating 
the HT TP request pattern in the Web server logs with the 
DNS request information. 


Our work makes the following contributions. We devel- 
oped a novel measurement methodology and architec- 
ture for accurately collecting local DNS server IP ad- 
dresses of Web clients. We demonstrated its successful 
deployment on several sites including a large commer- 
cial site and through the collection of a huge database 
of associations. Based on this data, we did an extensive 
analysis of the proximity between clients and their local 
DNS servers and discovered that significant improve- 
ment in proximity is possible by configuring clients to 
use a Closer local DNS server. Finally, we evaluated the 
impact of the proximity between clients and their local 
DNS servers on server selection in three of the largest 
commercially deployed CDNs. We conclude that DNS 
is good for very coarse-grained server selection, but less 
suitable for fine-grained request distribution. 


The rest of the paper is organized as follows. Section 2 
describes our methodology and measurement setup for 
gathering DNS client associations. In Section 3, the as- 
sociation results are analyzed in detail to evaluate the 
proximity between the client and its local DNS server. 
Then, in Section 4 we study the impact of proximity 
evaluation on DNS-based server selection in three of the 
largest commercially deployed CDNs. Related work is 
covered in Section 5. In section 6, we discuss future 
work. Section 7 concludes. 


2 Experimental methodology 


In this section we describe our novel technique for de- 
termining a Web client’s local DNS server. This is a 
necessary first step in measuring the closeness of clients 
to their local DNS servers. We also evaluate the impact 
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of our technique on end user performance. Later, in Sec- 
tion 5, we will explain how our technique is a significant 
improvement over related previous work in terms of ef- 
ficiency, nonintrusiveness, and accuracy. 


2.1 Measurement setup 


There are three main components necessary to use our 
technique: a specialized authoritative DNS server, an 
HTTP redirector, and a one-pixel embedded transparent 
GIF image. To obtain a client population we solicited 
volunteer Web sites. All the volunteers had to do to par- 
ticipate in our study was to add a link to our one-pixel 
transparent GIF to the end of one or more of their com- 
monly accessed Web pages. Assuming the experiment is 
hosted by us at example. con, this involves adding the 
following HTML code towards the end of a web page: 


<img sre="http://xxx.rd.example.com/tr.gif" 
height=1 width=1> 


To allow us to easily account for hits from different sites, 
each participant replaces xxx in the URL with a site 
identifier’. This allows us to easily add additional vol- 
unteer sites without having to make any changes to our 
Web or DNS server configuration. 


When a Web client loads the one-pixel embedded im- 
age, our technique allows us to match the address of 
the local DNS server resolving host names on behalf 
of the client with the address of the client itself. This 
process is shown in Figure |. First, the client attempts 
to get the image from xxx.rd.example.com — 
our HTTP redirector. Rather than serving the image, 
the redirector determines the client’s IP address and is- 
sues an HTTP redirect to ipCLI.cs.example.com, 
where CLI is replaced with a string encoding the IP 
address of the client (step 2). Next, the client contacts 
its local DNS server to resolve this domain name (step 
3). The client’s local DNS server attempts to resolves 
ipCLI.cs.example.comby sendinga DNS request 
to our authoritative DNS server (step 4). At this point 
our authoritative DNS server logs the IP address of the 
local DNS server and the client IP address embedded 
within the query. It then sends the address of the con- 
tent server hosting the image back to the client’s local 
DNS server (step 5). This resolution is passed on to the 
client (step 6), which retrieves the image from the con- 
tent server (steps 7 and 8). 


1Our authoritative DNS server [6] allows host names to be wild- 
carded, so we can set an address for *.rd.example.com. 
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Figure 1: Embedded image request sequence 


This measurement methodology has a limitation for 
clients that do not fetch inlined images and those that 
abort the page download process before the DNS resolu- 
tion is made for the embedded image. In these cases, we 
are unable to collect their local DNS server information. 


Note that in some cases, a local DNS server hierar- 
chy may exist. The local DNS server recorded in our 
measurement is the outermost local DNS server which 
directly contacts the authoritative DNS server for the 
example.com domain. In DNS-based server selec- 
tion, the CDN’s DNS server only sees the outermost lo- 
cal DNS server. In this study, this outermost DNS server 
is what we refer to as the “local DNS server.” 


This measurement approach is fully deterministic. It col- 
lects one association each time a new client visits a site 
with the embedded image. Multiple pages on the same 
site, or subsequent visits to the same page, may result 
in repeated retrievals of the calibrating image depending 
on the client’s caching policy. 


Note that the redirector also logs client requests — this 
information can be correlated with the DNS and web 
server logs to obtain the hidden load factors. Statistics 
on client browsing characteristics can also be gathered 
from the HTTP headers in the redirector log. 
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Table 1: Keynote image overhead measurements 


~ Location Avg download latency (sec) | Increased 
without image overhead 


World wide 1.17 1.31 12% 
AUS 104 | 4 | 10% FT 










2.2 Measurement impact 


Because we propose to use our measurement infrastruc- 
ture on a production Web site, it is important to evaluate 
its impact on the server performance and other aspects of 
its operation. The additional overhead our measurement 
technique imposes on Web client performance is the re- 
trieval of the transparent image, including the HTTP 
redirect and extra DNS requests. Because the image is 
transparent, it does not visually affect the page. Fur- 
thermore, the image is small in size-—43 bytes—which 
keeps the added delay to aminimum. We also encourage 
participants to include the image at the end of the HTML 
page containing it; therefore, browsers will normally re- 
quest it last. Thus, the extra latency associated with the 
image is usually hidden from the user’s Web browsing 
experience. Another advantage of the small size of the 
image is that when the image is not available for down- 
load, it does not affect the visual appearance of the Web 
page at all. 


Our custom HTTP redirector is a single-threaded, non- 
blocking, 300-line C program. The redirector responds 
to all Web requests with a ‘302 Moved Temporarily” 
HTTP redirect toa URL with the client’s IP address em- 
bedded in it. Due to the small size and overhead of the 
redirector, we found it to be highly reliable and more 
responsive than a standard Web server. 


To validate the claim of a small increase in latency, we 
measured a simple Web page with Keynote [2] to com- 
pare the download time with and without the embed- 
ded calibrating image. Keynote probes are located in 
25 cities within the US and 10 cities outside the US. The 
Web page we measured had a total size of 39 Kbytes in- 
cluding 13 images and was accelerated by a CDN. The 
increased overhead percentage is therefore higher than 
we would expect for a regular unaccelerated Web page 
with more embedded images. Table | shows that the in- 
creased overhead averages less than 140 ms, which is 
10-12% of the total download time. 


We also tested our system to see what would happen 
in the event of a failure of the redirector, image con- 
tent server, or DNS server. We found that the impact 
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Table 2: Participating sites in the study 


Type 

















# of l-pixel | Duration 
bl eel 
attcom 20,816,927 
Personal pages | 

(commercial domain) 1,743 


Personal pages 
26,563 | 3 months 


(university domain) 
























Type 


Client-LDNS associations 4,253,157 


HTIP requests | 25,425,123 
Unique client IPs 3,234,449 
Unique LDNS IPs — 157,633 


Client-LDNS associations where 

client and LDNS have the same IP address 
of failure on the user is minimal. We tested the failure 
of these three components using Microsoft Internet Ex- 
plorer (MSIE) 6 and Netscape Navigator 6 and found 
that those browsers will first load the rest of the Web 
page and then time out while trying to fetch the im- 
age.” There is no visible change to the Web page or 
any pop-up error message; however, the Netscape logo 


or MSIE browser logo will provide visual feedback until 
the browser times out. 





3 Analysis results 


We conducted our measurement study for about three 
months, and nineteen Web sites participated, as de- 
scribed in Table 2. We classify these sites into two cate- 
gories: commercial (sites 1-3) and educational (sites 4- 
19). As we show in Section 3.1, the client and local DNS 
associations visiting these two sites have very different 
characteristics. For ease of discussion, we use LDNS 
to represent a local DNS server. A total of 4,253,157 
unique client and LDNS associations were collected. Ta- 
ble 3 presents the statistics of the DNS server and the 
redirector log for all sites. 


To study the proximity between the client and its local 


“We tested with the default setting without any special options. 
Some older versions of both browsers were also tested giving the same 
behavior. 
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DNS server, we use the following four metrics. 


e AS clustering. Autonomous System (AS) cluster- 
ing refers to observing whether a client is in the 
same AS as its local DNS server. An AS Is a re- 
gion under a single administrative control. A sin- 
gle AS might contain an entire backbone or a large 
corporation which might span multiple continents. 
Therefore, AS-based clustering is the most coarse- 
grained metric we use. 


e Network clustering. This metric observes whether 
a Client is in the same netw ork-aware cluster (NAC) 
as its local DNS server, where network clusters 
are identified by the network-aware clustering tech- 
nique [13] using prefix entries from BGP routing 
table snapshots from a wide set of routing tables. 
Longest prefix matching is used to map clients to 
network clusters identified by a network prefix. All 
the clients within a network cluster are topologi- 
cally close together and with a high probability be- 
long to the same administrative domain. Validation 
tests (in [13]) using nslookup and traceroute show 
that the accuracy of network clustering is above 
90% across all the Web logs from the study by 
Krishnamurthy and Wang. Network clustering is 
much more fine-grained than AS clustering [12]. 


For both AS and network clustering, BGP prefixes 
and the association of IP CIDR blocks to ASes were 
extracted from an extensive set of BGP tables col- 
lected on May 27, 2001 from the sources listed 
by Krishnamurthy and Wang [13] and Telstra In- 
ternet [5]. There are a total of more than 440,000 
unique routing entries. 


e Traceroute divergence. This metric, used previ- 
ously in [15], is based on the length of divergent 
paths to the client and its local DNS server from a 
probe point using traceroute. It is defined to be the 
maximum number of disjoint network hops from a 
probe location to the client and its LDNS. 


e Round-trip time correlation. This metric, used 
previously in both [15] and [4], refers to examin- 
ing the correlation between the message round-trip 
times from a probe point to the client and its local 
DNS server. 


AS clustering, network clustering, and traceroute diver- 
gence are topology-oriented metrics, while round-trip 
time correlation is a performance-oriented metric. AS 
and network clustering are passive, requiring no active 
probing. The other metrics are highly dependent on the 


Table 4: Aggregate statistics of AS/network clustering 


total # of 
clusters 
AS clustering 9,570 


_ Network clustering 98,001 53,321 | 104,950 


#of LDNS 
clusters 















# of client 
clusters 







Metrics 













probe locations. To obtain an exhaustive evaluation of 
proximity, we include all four metrics in our study. 


3.1 AS and network clustering 


Table 4 shows the aggregate statistics from the data we 
collected—the number of clusters containing clients, the 
number of clusters containing local DNS servers, and 
the total number of clusters. We note that from daily 
routing table analysis from several major ISPs [9], up to 
12,000 unique ASes were identified as being in use on 
November 12, 2001. The theoretical limit on the pos- 
sible number of ASes is determined by the 16-bit AS 
identifier, resulting in a total of 64K ASes. Thus, we 
observed close to 80% of ASes that were identified on 
November 12, 2001 and close to 15% of the total possi- 
ble ASes. With regard to net work clusters, the maximum 
number of network clusters is 440K, since we used 440K 
unique prefixes. A one day extract from the 1998 Winter 
Olympic Games server log has 9,853 client clusters [ 13]. 
Thus, our measurement data contains close to ten times 
as many client clusters from one day of a popular Web 
server log and close to 25% of all possible network clus- 
ters. We conclude that the data we collected is extensive 
and covers a significant number of ASes and network 
clusters. 


Table 5 shows the percentage of client-LDNS associa- 
tions sharing the same cluster for clients visiting educa- 
tional sites, commercial sites, and all sites in our mea- 
surement study. We observe that clients visiting edu- 
cational sites have better proximity to their local DNS 
servers using the network- and AS- clustering metrics. 
This is expected since most of these clients also come 
from universities, which generally have a denser distri- 
bution of local DNS servers and better local DNS con- 
figurations than commercial ISPs. Because the major- 
ity of our log results from hits to the commercial sites, 
the proximity values for clients visiting all participating 
Sites are very close to those visiting commercial sites 
alone. Because CDNs are most likely to accelerate com- 
mercial sites, we believe our client mix is representative 
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Table 5: Percentage of client-LDNS associations sharing the same cluster classified according to the types of domains 


visited by the clients 


“Metrics — Client IPs HTTP requests 
educational | commercial | combined | educational | commercial combinec 
AS cluster 70% | 63% | 64% 83% 68% 69% 
28% 16% 16% 44% 23% 24% 


of clients visiting a CDN-accelerated site. In the follow- 
ing discussion, we consider clients visiting all participat- 
ing sites. 


Using AS clustering, 64% of distinct client-LDNS asso- 
ciations share the same AS. Thus, more than half of the 
clients use a local DNS server in the same AS. This is 
expected, since it is common for an administrative do- 
main to run its own DNS server. If users configure their 
DNS settings correctly, they typically use the LDNS in 
their administrative domain by default. About 69% of 
the HTTP requests come from clients using an LDNS 
server in the same AS cluster. This means clients with 
LDNS in the same AS are slightly more active than those 
that use an LDNS in another AS. 


The above results indicate that in about 64% of the cases, 
CDNs could select appropriate servers using DNS redi- 
rection with the granularity of ASes. Thus, even if a 
CDN deployeda cache in every AS in the world, it could 
select the closest cache according to the AS metric only 
in 64% of the cases. However, AS clustering does not 
reveal how well redirection works for finer-grained load- 
balancing. An AS can span large geographical regions, 
causing network delays between two hosts within the 
same AS to be relatively high. For finer-grained load- 
balancing it is therefore important to consider network 
clustering, which groups together IP addresses that are 
close together topologically and likely to be under the 
same administrative domain. 


The observations using network clustering are signifi- 
cantly different from the AS clustering results. Only 
16% of the client-LDNS associations are in the samc 
network cluster. This shows that most clients are nof in 
the same routing entity as their local DNS servers. If the 
HTTP request count is taken into account, about 24% 
of the HTTP requests in our logs originated from clients 
that used an LDNS in the same network cluster. Again, 
the difference between these two numbers demonstrate 
that clients with LDNS in the same network clusters are 
more active than those with LDNS ina different network 
cluster. 


General Track: 2002 USENIX Annual Technical Conference 


Overall, these results indicate that DNS-bascd redirec- 
tion can confidently select appropriate CDN servers with 
the granularity of an AS. However, for CDNs with mul- 
tiple servers in the same AS, the selection may not be 
as accurate. If there is a CDN server in each network 
cluster, then DNS-based redirection will only select the 
CDN server in the same network cluster as the client 
about 24% of the time. 


3.2 ‘Traceroute divergence 


Another metric to evaluate the proximity between the 
client and its local DNS server is the maximum num- 
ber of disjoint network hops from a probe location to 
the client and its local DNS server. In [15], this met- 
ric is referred to as the traceroute cluster size. The 
smaller the cluster size or traceroute divergence, the 
closer the client is to the local DNS server. In many of 
our traceroute results, we found that the network routes 
from the probe site to the client and its LDNS diverge 
and converge multiple times due to router load balanc- 
ing. We use the last point of divergence as the reference 
for calculating disjoint network hops. For example, Ta- 
ble 6 shows the network routes obtained by performing 
traceroute to the client 112.74.197.163? and its LDNS 
112.25.195.1. We use hop 11 instead of 2 as the point 
of divergence. Thus, the traceroute divergence in this 
example is maz(14 — 11,13 —11) =3. 


We selected four probe sites representing candidate 
CDN servers and performed traceroute to a sample of 
clients and local DNS servers from the log. The sample 
consists of 48,908 client-LDNS pairs or 66,975 IP ad- 
dresses. It is obtained by randomly selecting one client- 
LDNS pair from the top half of the client network clus- 
ters generating the most HTTP requests. The number of 
client-LDNS pairs reached by an individual probe site 
ranges from 9,878 to 11,935. In about 20% of these, 
both the client and the LDNS belong to the same net- 
work cluster. And in about 75% of these, both the client 
and the LDNS belong to the same AS cluster. 


3For privacy concerns, the IP addresses have been anonymized. 
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Table 6: Traceroute divergence 

























1 112.0.1.15 ms 
2 112.124.182.17 15 ms 
3 112.123.1.22 14 ms 

4 112.122.5.246 7 ms 

§ 112.122.2.2 24 ms 

6 112.122.2.206 31 ms 

7 112.122.2.41 35 ms 

8 112.122.2.26 68 ms 

9 112.122.2.121 77 ms 
10 112.123.145.25 72 ms 
11 112.124.23.6 73 ms 


12*** 


13 * 112.25.195.171 ms 


1 112.0.1.16ms 
2 112.124.182.176 ms 

3 112.123.1.107 ms 

4 112.122.1.149 8 ms 

5 112.122.2.173 25 ms 

6 112.122.2.206 32 ms 

7 112.122.2.41 34ms 

8 112.122.2.26 71 ms 

9 112.122.2.121 75 ms 
10 112.123.145.25 73 ms 
11 112.124.23.6 72 ms 
12 112.25.192.2 72 ms 
13 112.25.192.181 73 ms 
14 112.74.197.163 92 ms | 
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Figure 2: Proximity evaluation using traceroute diver- 
gence 


Figure 2 shows the cumulative distribution of traceroute 
divergence for the sampled client LDNS pairs. About 
14% of them have traceroute divergence of 1. The mean 
divergence varies from 5.8 to 6.2 depending on the probe 
site, and the median traceroute divergence is 4 from all 
four probe sites. This means that a large fraction of 
clients are topologically quite close to their local DNS 
servers using the hop count metric. At most 30% of the 
client-LDNS pairs have traceroute divergence of size 8. 
This result is slightly inconsistent with the results de- 
scribed by Shaikh, et al. [15] considering 1,090 client- 
LDNS pairs of dial-up ISPs. We believe that the dif- 
ference can be explained by the fact that our results are 
based on the analysis of a much larger set of populations 
visiting both commercial and educational sites. 


The absolute values of traceroute divergence may not be 
compketely indicative of the proximity of a client to its 
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Figure 3: Ratio of common to disjoint path length 


local DNS server. In Figure 3, we plot the ratio of the 
common path length to the disjoint path length from a 
probe site. Using the terminology of Shaikh, et al. [15], 
the common path length is the minimum number of net- 
work hops of the shared path from the probe site to the 
local DNS server and the client before their paths di- 
verge. For example, the common path length of client 
112.74.197.163 and its LDNS 112.25.195.] (shown in 
Table 6) is min(11,11) = 11. The disjoint path length 
is the maximum number of network hops of the diverg- 
ing paths. In this example, the divergent path length is 
max(14-11, 13-11)=3. Again, we use the last point of 
the divergence as the reference point. For all probe sites, 
less than 34% of the client-LDNS pairs have disjoint 
paths at least as long as the common path. This means 
that at least 66% of client-LDNS pairs have a common 
path as long as or longer than their disjoint path. This 
metric implies that most clients are topologically close 
to their LDNS as viewed from a randomly chosen probe 
site. 


3.3. Round-trip time correlation 


Some CDNs sekct servers based on the round-trip la- 
tency between the CDN server and the client’s local 
DNS server [15]. It is therefore important to understand 
the correlation between the round-trip delay to a client 
and to its LDNS from a third location. 


To compare with the results presented in [15], we study 
how the round-tnp dela ys to the client and its LDNS de- 
termine the accuracy of the CDN server selection based 
on round-trip delays to the LDNS. Since our data set 
consists of more than 4.2 million pairs of client and 


General Track: 2002 USENIX Annual Technical Conference 


235 


236 


LDNS, much larger than that presented in [15] (1,090 
pairs), we expect some differences. Let tt and t}, be the 
round-trip delays between the probe site 2 and the client, 
and between the probe site 2 and the client’s LDNS, re- 
spectively. We ask the question whether tt, < ti, implies 
t? < ti. Depending on the locations of two probe sites 
1 and 7, the percentage of violations ranges from 17% to 
38%. For instance, among the 9,360 client-LDNS pairs 
responding to traceroute from both probe site | and 2, 
about 38% violate this assumption. This implies that if 
one selects between two CDN servers located at probe 
sites | and 2 based on the round-trip delays to the LDNS, 
the decisions would be suboptimal 38% of the time for 
the set of clients considered based on the round-trip de- 
lay metric. On the other hand, among the 7,895 pairs re- 
sponding to traceroute from both probe site 2 and 4, only 
17% violate this assumption. This means that this metric 
is highly dependent on probe locations. However, it is a 
reasonable metric for use to avoid really distant servers. 


Another interesting question to answer is whether, if two 
CDN servers are roughly an equal distance from the 
LDNS based on the round-trip delay, the same holds 
from the client’s perspective. Thus, we ask whether 
It! — t2,| < w implies |tt — t2| < w, where w is a small 
number (e.g., a 10 ms threshold was used by Shaikh et 
al. [15]). In the sample of our study, it holds in 44-75% 
of the cases depending on the probe sites. This num- 
ber is bigger than the previously obtained result of 12% 
in (15). 


3.4 Improved local DNS configuration 


For the client and local DNS associations that are not 
in the same network cluster, we ask whether there exist 
any local DNS servers in those clusters. From our log, 
we collected a set of local DNS servers. Thus, assum- 
ing the clients have access to those local DNS servers 
in their network clusters, it is interesting to examine the 
degree of improvement if all LDNS servers were used 
optimally. This assumption is not unreasonable, since 
most IP addresses in the same network cluster are under 
the same administrative control. From Table 4, we can 
calculate the number of client ASes and network clusters 
where there are no local DNS servers as observed in our 
log. There are 9,570 — 8,590 = 980 such AS clusters, 
and 104, 950 — 53, 321 = 51, 629 such network clusters. 
Table 7 compares the improved percentages of client- 
LDNS associations and HTTP requests in the same clus- 
ter with the original results. If the clients in our data cur- 
rently configured to use a LDNS in a different cluster are 
allowed to use an LDNS in the same cluster, then at least 
92% of the HTTP requests come from clients using the 
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Table 7: Improvement of the percentage of the client- 
LDNS associations sharing the same cluster using opti- 
mal LDNS assignment 


Metrics Chient IPs 


HTTP requests | 





LDNS in the same AS cluster. That number is 70% for 
network clusters. 


3.5 Clients using multiple local DNS servers 


Some client IP addresses in our data are associated with 
multiple LDNS IP addresses. This may happen due to 
the following reasons: (1) The first LDNS server the 
client contacts times out and the second LDNS server 
is contacted. (2) The client’s LDNS server is config- 
ured by a DHCP server that assigns the LDNS server 
IP addresses from a set of addresses in a round-robin 
fashion. (3) A client may be configured to round-robin 
among multiple LDNS servers. (4) The client IP address 
is reused at different times by different users and these 
users may have different configurations for their LDNS 
servers, resulting in different associations . (5) The client 
IP address is that of a NAT box or a application-level 
proxy, so there are multiple actual clients behind this IP 
address using different LDNS servers. (6) The client is 
misconfigured . 


Here we examine the distribution of the LDNS servers 
with which a client IP address is associated. If they all 
occupy the same cluster as the client, DNS-based server 
selection can use the local DNS server’s IP address to 
estimate where the client is even if the client uses multi- 
ple local DNS servers. However, if they occupy multiple 
clusters or a single cluster different from the client, it is 
more difficult to use DNS-based server selection. In Ta- 
ble 8, we show how many clients use ten or fewer local 
DNS servers. In addition, we observe that some IP ad- 
dresses are associated with up to 330 local DNS servers 
occupying up to 273 different network clusters. Further 
investigation shows that some of these addresses belong 
to cache proxies. In general, we observe that the more 
LDNS servers with which a client IP address is associ- 
ated, the lower the percentage of associations with the 
client and LDNS in the same cluster. Fortunately, the 
majority of client IP addresses are associated with a sin- 
gle LDNS server. They are responsible for about 52% 
of the requests. However, only about 20% in this group 


USENIX Association 


USENIX Association 


Table 8: Clients using ten or fewer multiple local DNS 
servers 


% of total | % associations | 






# of LDNS ~ 


| #of clients 















(% of total) (ave # of HTTP with client and 
NACs) requests LDNS in 
the same NAC 
| 2,524,939 (78.064) | 1 (1.0) 20.3 
522,228 (16.146) 2 (1.6) 22.4 
par aay [san | mat 66 
41,422 (1.281) 4 (2.5) 4.9 
4,555 (0.141) 6(3.3) | 18 
Hasan —[ rans 
[713 (0.022) 
| 461 (0.014) 9 (5.5) 0.7 
273 (0.008) 0.5 14.0 


have the client and LDNS in the same network cluster. 
3.6 Comparisons of proximity metrics 


Given the above set of metrics for evaluating proxim- 
ity between client and its local DNS server, we compare 
their results on a common set of 7,894° client-LDNS as- 
sociations in Table 9. The comparison shows that net- 
work clustering is a fine-grained metric, similar to trace- 
route divergence (TD) count of 1. Hosts within the same 
network cluster, or which have a TD of 1, are guaranteed 
to be very close to each other. However, hosts not in the 
same network cluster, or have a TD bigger than 1, may 
still be quite close. Thus, these two metrics are quite 
conservative. AS clustering is the most coarse-grained 
metric, since an AS can be quite large. This is compara- 
ble to the ratio of common to disjoint path length. RTT 
correlation is also a relatively coarse-grained metric. It 
is inconclusive and largely dependent on the two probe 
site locations. 


In general, performance-oriented metrics such as round- 
trip time should provide accurate real-time network la- 
tency measurements. CDNs often do real-time network 
measurements from their servers to clients. Since we 
can only probe from a limited set of locations, such met- 
rics are inconclusive. Topology-oriented metrics have 
the advantage of being non-invasive, since they do not 
incur any network overhead. However, they cannot take 
network congestion into account. 


As we explain in the following section, the applicability 
of each metric depends on the density of CDN server 
placement. The denser the placement, the more fine- 


*Only 7,894 of all associations can be reached from both probe 
sites 2 and 3. 


Table 9: Comparison of four proximity metrics 


















Evaluation 
78% in the same cluster 


16%: TD=1, 32%: TD=2 
median TD=4, mean TD =5.7 
65%: disjointPathLen 

< commonPathLen 
71 tt < ty =>te< te 
62%: |t2 — t3| < 10ms > 

|t2 —t2| < 10ms 

a=t2-t,b=¢2-2# 
correl(a,b) = 0.13 


Proximity metric 


| AS clustering 
Network clustering 
Traceroute divergence 


(TD) 
(probe site 2) 




















~ RTT correlation 
(probe sites 2, 3) 







grained metric is needed. 
4 Application impact 


In this section, we focus on the impact that client-LDNS 
associations have on DNS-based server selection. We 
study this impact in detail for three of the largest com- 
mercial CDNs. We anonymize the CDN names to prop- 
erly reflect the nature of this work as a research vehicle 
rather than any form of competitive analysis. All three 
CDNs chosen rely on deploying caches in multiple net- 
works. ISP-based CDNs deployed by companies like 
AT&T and Qwest are excluded from this study, since 
their caches are located in one or two ASes. Since a 
client and its LDNS are very likely to be in the same 
AS (about 69% of HTTP requests in our study), an ISP- 
based CDN can easily identify a peering link that is suit- 
able for the AS containing both of them>. The results 
described below are representative of all the data we col- 
lected and remained stable during our entire study. 


Previous work by Johnson, et al. [10] has shown that 
DNS-based CDNs do not always pick the best server 
available. Here we study whether this is partly due to 
the inherent limitations of DNS-based server selection. 
The answer to this largely depends on the proximity be- 
tween clients and local DNS servers and the location of 
CDN servers. 


The proximity evaluation of client-LDNS associations 
using the network clustering metric indicates that, if a 
CDN had a server in each network cluster, about 84% 
of the selection decisions for the client population in 
our log could be suboptimal. This is because our study 


>The main tradeoff here is fewer peering links traversed in multi- 
ISP CDNs versus less traffic between access and backbone routers as 
well as lower costs in single-ISP CDNs. 
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found only 16% of these clients have their LDNS in the 
same network cluster. For clients with their LDNS in 
different network clusters, the CDN would most likely 
resolve the DNS query from a client’s LDNS to the CDN 
server in the LDNS’s cluster and not the cluster where 
the client resides. In reality, and as we show below, even 
the biggest CDN today does not have a CDN server in 
every network cluster. Thus, it is important to examine 
the impact of DNS-based redirection in a commercial 
content distribution setting. 


We assume that on average a CDN server within the 
client’s AS/network cluster or smaller traceroute diver- 
gence (TD) is closer than one in a different cluster or 
larger TD. For clients with CDN servers in their clusters, 
if a CDN selects a server not in a client’s cluster, this 
may be a suboptimal decision in terms of proximity. We 
also assume that CDNs attempt to optimize for proxim- 
ity in most cases. Network bandwidth is less important, 
since the content delivered by these CDNs is relatively 
small in size. Although CDNs may also incorporate the 
avoidance of overloaded servers in their server selection 
algorithms, we believe that our assumption is reasonable 
because CDNs today are highly overprovisioned from 
the perspective of server capacity. Furthermore, we re- 
peated our experiments on separate dates to avoid any 
possibility of a skew due to a flash event, and the results 
were always similar. One limitation in our results below 
is that we do not quantify suboptimal server selection in 
terms of end user performance, nor how close it is to the 
optimal server selection. 


We first describe our measurement methdology then 
use AS/network clustering and traceroute divergence to 
study how the proximity between client and LDNS af- 
fect DNS-based server selection in three commercial 
CDNs. 


4.1 Experiment methodology 


We use the following three data sets for our study. 


1. Client-LDNS associations. These associations be- 
tween clients and their LDNS servers are obtained 
from our measurement study. 


2. LDNS-CDN server associations. For a given 
CDN, these associations map LDNS servers from 
the first data set to the CDN servers selected by 
the CDN when resolving a query from these LDNS 
servers. 


3. Available CDN servers. This data set represents a 
list of CDN servers available in a given CDN. 


In the first data set, we sampled 42,991 LDNS servers 
from our measurement study. We obtained the second 
data set by sending DNS queries to these 42,991 LDNS 
servers using the dig command for a domain name of a 
Web site that we know is a customer of a given CDN. 
27,918 of these LDNS servers do not use access con- 
trol and hence answered the queries from our machines, 
as if these machines were their clients. To answer our 
queries, these LDNSs recursively resolved our queries 
with the CDN in question. The server selected by the 
CDN for this DNS query is exactly the same server that 
would be used by any real client associated with this 
LDNS, as if that client and not our machine initiated the 
DNS gquery.°® 


The third data set was obtained in a similar way, except 
we added a large number of additional LDNS servers to 
the 27,918 LDNS servers above, for a total of 41,754 
different local DNS servers. This is to increase the like- 
lihood of finding all CDN servers of a particular CDN 
for a given domain. The extensive list of geographically 
distributed LDNS servers was obtained from DNS server 
logs for a large Web site. The set of servers to which a 
given CDN resolved queries from these LDNSs repre- 
sents the servers available in this CDN at the time of 
the experiment. We obtained our second and third data 
sets at around the same time each day to find the set of 
servers available to a CDN at the time it performed its 
server selection in the second experiment. 


Note that our set of available servers is conservative, 
since we might not have discovered all available CDN 
servers. However, if a CDN performs a suboptimal 
server selection among a subset of all available servers, 
its server selection will remain suboptimal for a larger 
set: suboptimal means that we already found a closer 
server to the client than the one selected by the CDN. A 
superset of the list of servers would suffer from the same 
suboptimal assignment. 


Many CDNs claim a much larger number of caches. 
However, CDNs do not utilize all servers for all Web 
sites and many of their locations may contain multiple 
caches. The statistics we gathered are for a particular 
domain served by a CDN. For example, when examining 
multiple different domain names served by the largest 
CDN in our study, we found multiple CDN IP address 
sets of approximately equal size which only partly over- 
lapped. Each unique server IP address we discover may 
also account for multiple servers. 


Note, for fault-tolerance, most CDN DNS servers usually retumn 
multiple IP addresses. In this case, we pick the first one, since clients 
also typically choose the first IP address. 


Ss 
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Table 10: CDN cache servers for a particular domain 
name 


# of CDN 
servers IPs 


#of AS | #of network 
clusters clusters 
with servers | with servers 





Table 11: The evaluation of server selection according 
to AS clustering 


CDN Y 
1,215,372 


CDN 


Clients w/ CDN 

server in Cluster 
Misdirected clients 809,683 752,822 | 434,905 
(% verifiable clients) (60%) 

(% clusters occupied) (92%) 

MC w/ LDNS 

not in client’s cluster 
(% misdirected 
clients) 


1,679,515 618,897 


443,394 354,928 | 262,713 


(55%) (47%) | (60%) 


Table 10 shows the statistics of the CDN server IP ad- 
dresses of the three CDNs studied for a single domain 
name obtained on August 7, 2001. These numbers were 
fairly stable during the course of our study. All three 
CDNs examined appear to redirect client requests by us- 
ing DNS, although they may diffier in the details of the 
algorithms. This table lists the total number of CDN 
servers discovered and the number of AS and network 
clusters these CDN servers represent. The data in Table 
10 confirm our conjecture that CDNs today cover only a 
small] number of all available network clusters for a sin- 
gle domain they serve. While the overall list of LDNSs 
used for generating the third data set represents 5,788 
AS and 21,786 network clusters, the discovered CDN 
servers represent only a small fraction of these, even in 
the case of the largest CDN in our study. 


With the three data sets above, we evaluate the quality 
of server selection by these CDNs by examining what 
percentage of clients are actually redirected to servers in 
their own cluster, among those clients that have at least 
One server in their cluster. 





Table 12: The evaluation of server selection according 
to network clustering 


| CONX | CDNY | CDNZ 
264,743 


Clients w/ CDN 103,448 


server in Cluster 

Misdirected clients 154,198 | 125,449 87,486 
(% verifiable clients) (68%) (94%) (96%) 
(% clusters occupied) (77%) (82%) (93%) 


MC w/ LDNS 
145,276 | 116,073 | 84,737 
(94%) | (93%) | (97%) 


not in client’s cluster 
(% misdirected clients) 

4.2 Results of DNS-based server selection in 

commercial CDNs 


156,507 





Tables 11 and 12 show the results of our server selection 
evaluation using AS and network clustering. We col- 
lected 3,234,449 distinct client IP addresses in our logs. 
The first row of the table contains the number of clients 
with CDN servers in their clusters for the considered 
CDNs. Depending on the server density of each CDN, 
the number of clients with servers in their AS clusters 
ranges from I9% to 52% of the total clients in the log. 
This fraction is an order of magnitude lower in the con- 
text of network clusters. Thus, according to either met- 
ric, most clients will have to be served by remote servers. 
But a more interesting question is how many clients that 
could have been served by local servers are in reality di- 
rected to remote ones. 


To answer this question, we concentrate on clients with 
servers in their clusters and consider the LDNS-CDN 
server associations for these clients from the second data 
set. Unfortunately, not all of these LDNS servers re- 
spond to DNS queries from our machines. The second 
row of the tables gives the number of clients, among 
those with CDN servers in their clusters, whose LDNS 
servers responded to our queries. We call these clients 
verifiable because we could find out which CDN servers 
a CDN would redirect these clients to. The third row 
shows the number of clients that a CDN directed to an 
external CDN server (one that was outside the client’s 
cluster), when there was an available CDN server within 
that cluster. We refer to such clients as misdirected 
clients (MC) based on the assumption that CDN servers 
within the cluster are closer than external ones, although 
we accept that other factors than proximity may have 
affected the assignment. We see a large number of mis- 
directed clients according to both proximity metrics. To 
confirm that these misdirected clients are not due to any 
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anomaly of clients belonging to a small number of clus- 
ters, we also show in the third row the percentage of 
clusters occupied by these clients relative to the total 
number of clusters of verified clients. The cluster per- 
centage values are at least as big as the client percentage 
values. This means that the misdirected clients are fairly 
spread out in the number of clusters they occupy. 


We conjecture that the reason that these clients are mis- 
directed is that their LDNS servers are topologically dis- 
tant from these clients. CDNs select a server close to 
the LDNS servers. The servers selected may therefore 
be suboptimal from the client’s perspective. The last 
row of the tables shows misdirected clients with their 
LDNS outside their clusters. This row indicates the 
number of clients that inherently cannot be directed to 
the most proximal server using a DNS-based mecha- 
nism. According to Table 11, for AS clustering, they 
represent only half of misdirected clients. To understand 
why CDNs choose a CDN server in a different AS than 
the one containing the client and its LDNS server, we 
sampled a dozen of these clients using traceroute fol- 
lowed by DNS name resolution of the last-hop router 
IP address to estimate the geographic locations’ of the 
client, CDN servers in the client’s AS, and selected CDN 
servers in a different AS. We found that in most cases, 
the selected CDN servers by CDNs are geographically 
closer to the client than CDN servers in the same AS. 
Assuming peering links between the client’s AS and the 
selected CDN server’s AS are not congested, redirect- 
ing to a nearby CDN server in a different AS may be a 
better decision than redirecting to a distant CDN server 
in the same AS. This observation also confirms our find- 
ing that AS clustering is a very coarse-grained metric for 
evaluating proximity. 


For network clustering, the last row of Table 12 indicates 
that an overwhelmingly majority of misdirected clients 
have their LDNS servers in a different network cluster. 
This confirms our hypothesis that such misdirection 1s 
due to the fact that clients and their LDNS servers are of- 
ten not proximal. It also shows the usefulness of network 
clustering because it is a fine-grained metric for eval- 
uating proximity. We emphasize that we do not know 
the exact server selection policy used by a commercial 
CDN, so we cannot fully evaluate the effectiveness of 
its server selection decisions. However, given that there 
is such a strong correlation between misdirection and an 
LDNS being in a different cluster, we can infer that when 
the LDNS and client do not belong to the same network 
cluster, this limits the accuracy of server selection. 


7In many cases, the router’s DNS name has an indication of the 
geographic location [14]. 


Table 13: The evaluation of server selection according 
to traceroute divergence (TD) from probe site 3 


CDNX | CDNZ, 
Client-LDNS pairs examined 2,105 2,171 











Clients with CDN servers at smaller 1,606 1,724 
Median TD of CDN servers 13 
ciensrediecedo || 
Median TD of closest CDN i 


servers to clients 


Median TD improvement | C6] 4 







Table 13 shows the evaluation of DNS-based server se- 
lections according to the traceroute divergence metric.® 
We performed traceroute from probe site 3 to a sample of 
client and local DNS servers from the log and the CDN 
cache servers from the third data set. The sample 1s cho- 
sen by randomly selecting one client-LDNS pair from 
the top 1200 client clusters generating the most HTTP 
requests. We found over 70% of the clients to be di- 
rected to a CDN server that is more distant than another 
available CDN server. Selecting the closest CDN server 
would have reduced traceroute divergence by as much as 
19 hops for some clients. 


Overall, we conclude that, among the clients we could 
verify, knowing the client’s IP address would allow more 
accurate server selections in a large number of cases 
(443,394 for CDN X). The last row of Tables 11 and 12 
also indicate the number of improved CDN server selec- 
tions if the client’s IP address were known to the CDN. 
Relative to the total number of clients, in the case of 
CDN X, this represents a small percentage: specifically 
14% (443,394 out of 3,234,449), In general, the num- 
ber of misdirected clients depends on the server density, 
placement, and selection algorithms. 


5 Related work 


Our work is motivated by a related effort by Shaikh, 
et al. [15] examining the effectiveness of DNS-based 
server selection. They developed a method of find- 
ing client-LDNS associations using time correlations of 
DNS and HTTP requests from DNS and Web server 
logs. However, as they have noted, the associations ob- 
tained using their method are inherently inaccurate due 
to clock skews, client DNS caching, and mishandling of 
TTLs. To resolve ambiguities, they used heuristics based 


’We were unable to include CDN Y in the traceroute experiment, 
since most of its CDN servers are unreachable using traceroute. 
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on AS numbers and domain names to decide whether 
a client and a nameserver did in fact belong together. 
This heuristic removed misconfigured client-nameserver 
pairs and did not assure the correctness of associations. 
They also obtained a set of 1090 client-LDNS associa- 
tions from accounts with 9 commercial ISPs to study the 
proximity correlations. 


In comparison, our method provides accurate associa- 
tions eliminating any need for validation. Furthermore, 
our study has more than 4.2 million associations, con- 
sisting of clients from a diverse set of ISPs, far exceed- 
ing their data set of 1090 associations. 


Morerecently, Bestavros, et al. [4] have also developed a 
method for finding client-LDNS associations by assign- 
ing multiple IP addresses to a Web server and correlat- 
ing DNS lookups with client IPs based on the server IP 
used. This method is slow in discovering client-LDNS 
pairs due to the limited number of IP addresses a Web 
server can have. In addition, their method is complicated 
to implement, requiring reassignment of server IPs and 
modification of the Web server. 


Compared to both works, the distinguishing features of 
our measurement methodology are efficiency, nonintru- 
siveness, and accuracy. This allowed us to collect more 
extensive data, which we used to evaluate the effective- 
ness of DNS-based server selections using four different 
proximity metrics in several real-world CDN settings. 
To our knowledge, we are the first to conduct such an ex- 
haustive proximity evaluation between clients and their 
local DNS servers using such a representative data set. 
We are also not aware of other work in examining the 
impact that the proximity between the local DNS server 
and the client has on DNS based server selection in com- 
mercial CDNs. 


There has been a recent effort within the IETF to cat- 
egorize different mechanisms for request routing in 
CDNs [3]. DNS-based redirection is one of those mech- 
anisms, and our methodology may prove useful in eval- 
uating the effectiveness of this technique in that context. 


6 Future work 


There are three areas of future work we plan to pursue. 
First, we plan to study the hidden load factors due to dif- 
fering amounts of HTTP load corresponding to a DNS 
name resolution request from an LDNS server. With the 
help of a busy Web site, we will be able to gather statis- 
tics on the number of HTTP requests and clients behind 
each LDNS server. Identifying LDNS servers resulting 


in large numbers of HTTP requests is essential for proac- 
tive load balancing and flash crowd protection. 


Second, we plan to improve existing DNS-based server 
selection algorithms by considering the properties of 
known client-LDNS associations for an LDNS that re- 
quests a server name resolution. The following charac- 
teristics of the associations can be explored based on 
data collected using our methodology: known client 
proximity to the LDNS, known client distribution, and 
hidden load factor. 


Given a name resolution request from an LDNS, if the 
known client proximity to the LDNS is good, then a 
CDN server close to the LDNS would also be close to its 
clients. If the proximity correlation is low, known client 
distribution and client cluster request patterns would be 
considered. If the majority of HTTP requests belong to a 
single network cluster, finding a CDN server close to or 
within that network cluster would also be close to clients 
issuing a majority of requests. Along with these factors, 
the hidden load factor of the LDNS 1s also considered to 
select lightly loaded CDN servers for an LDNS with a 
large hidden load factor. If the proximity correlation is 
low between LDNS and its clients, then server selection 
is optimized using other metrics such as server load. 


Finally, we would like to apply the results of this work to 
improving content distribution internetworking (CDI), 
which refers to the interoperation among multiple CDNs 
for additional flexibility. A prototype of CDI, called 
CDN Brokering [6], uses a DNS-based brokering mech- 
anism to forward requests among DNS servers of the 
interoperating CDNs. As a third area of future work, 
we plan to improve CDN brokering algorithms by us- 
ing hidden load factors and client-LDNS proximity in- 
formation. The client-LDNS proximity findings in our 
work justify DNS-based brokering, because the major- 
ity of the clients and their LDNS belong to the same AS. 


7 Conclusion 


In this paper, we propose a novel technique for finding 
client and local DNS server associations and potentially 
hidden load factors in a fast, non-intrusive, and accu- 
rate manner. Based on the results, we evaluate the prox- 
imity between clients and their LDNS using four met- 
rics: AS clustering, network clustering, traceroute diver- 
gence, and round-trip time correlation. 


We evaluate the potential effectiveness of DNS-based 
server selection in CDNs based on these metrics. We 
conclude that DNS is good for very coarse-grained 
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server selection, since 64% of the associations belong to 
the same AS. DNS 1s less useful for finer- grained server 
selection, since only 16% of clients use a DNS server in 
the same network-aware cluster. These values can be im- 
proved to 88% and 66% respectively, if clients are con- 
figured to use a closer local DNS server. Since current 
CDNs are not present in many network-aware clusters, 
we conclude that although DNS-based server selection 
has inherent limitations due to potentially poor proxim- 
ity correlation between a client and its LDNS, the impact 
is smal] due to the sparse distribution of CDN servers in 
today’s CDNs. 


At least one CDN has stated a goal of ultimately placing 
CDN servers in every edge network. The high fraction of 
clients using LDNS servers in different network-aware 
clusters suggests that CDNs may be unable to use DNS 
request routing for such fine-grained server selection un- 
less DNS itself scales to provide each edge network with 
a local DNS server that communicates directly with the 
Internet. Thus, from an economic perspective, due to the 
inherent limited precision of DNS-based server selec- 
tion, it is less beneficial to have so many CDN servers 
that the performance to two nearby servers is indistin- 
guishable. 


In addition to the proximity evaluation and the novel 
measurement methodology, our work also provides two 
additional contributions in improving DNS-based CDNs 
in general. From our observation, client-LDNS asso- 
ciations are fairly static. Thus, CDNs can build up a 
database of such associations to infer the geographic lo- 
cation of clients associated with an LDNS IP address 
to improve server selection. Furthermore, based on the 
URL-rewriting technique in our measurement method- 
ology, CDNs can completely eliminate the originator 
problem by embedding the client IP addresses in the 
URLs of the Web pages, when a client initially requests 
the base page. 
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Abstract 


In this paper, we study the geographic properties of In- 
ternet routing. Our work is distinguished from most pre- 
vious studies of Internet routing in that we consider the 
geographic path traversed by packets, not just the net- 
work path. We examine several geographic properties 
including the circuitousness of Internet routes, how mul- 
tiple ISPs along an end-to-end path share the burden of 
routing packets, and the geographic fault tolerance of 
ISP networks. We evaluate these properties using exten- 
sive network measurements gathered from a geograph- 
ically diverse set of probe points. Our analysis shows 
that circuitousness of Internet paths depends on the ge- 
ographic and network locations of the end-hosts, and 
tends to be greater when paths traverse multiple ISP. 
Using geographic information, we quantify the degree 
to which an ISP’s routing policy resembles hot-potato 
or cold-potato routing. We find evidence of certain tier- 
I ISPs exhibiting hot-potato routing. Finally, based on 
network topology information gathered at CAIDA, we 
find that many tier-] ISP networks may have poor toler- 
ance to the failure of a single, critical geographic node, 
assuming the published topology information is reason- 
ably complete. 


1 Introduction 


The Intemet consists of several autonomous systems 
(ASes) that are under the control of different admin- 
istrative domains. Routing across these administrative 
domains is accomplished using the Border gateway 
protocol (BGP), a protocol for propagating routes be- 
tween ASes. ASes connect to each other either at pub- 
lic exchanges or at private peering points. The network 
path between two end-hosts typically traverses multiple 
ASes. BGP is flexible in allowing each AS to apply its 
own local preferences, and export and import policies 
for route selection and propagation. The characteristics 
of an end-to-end path are very much dependent on the 
policies employed by the intervening ASes. 


Previous work on Internet routing has focused on study- 
ing properties such as end-to-end performance, routing 
stability, and routing convergence that are affected by 
routing policies. There has also been work on strategies 
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for deterinining alternate (and hopefully better) routes 
by using overlay networks to circumvent the default In- 
ternet routing. We discuss previous work in more detail 
in Section 2. 


In this paper, we presenta novel way of analyzing certain 
properties of Internet routing. We show how geographic 
information can provide insights into the structure and 
functioning of the Internet, including the interactions be- 
tween different autonomous systems. In particular, geo- 
graphic information can be used to quantify well-known 
network properties such as hot-potato routing. It can also 
be used to quantify and substantiate prevalent intuitions 
about Internet routing, such as the relative optimality of 
intra-ISP routing compared to inter-ISP routing. 


To analyze geographic properties of routing, it is neces- 
sary to first determine the geographic path of an IP route. 
The geographic path is obtained by stringing together 
the geographic locations of the nodes (1.e., routers) along 
the network path between two hosts. For instance, the 
geographic path from a host in Berkeley to one in Har- 
vard may look as follows: Berkeley --- San Francisco — 
New York — Boston — Cambridge. The level of detail 
in the geographic path would depend on how precisely 
we are able to deternine the locations of the intermne- 
diate routers in the path. In Section 3, we describe Geo- 
Track [13], a tool we have developed for determining the 
geographic path of routes. Our study is based on exten- 
sive traceroute data gathered from 20 hosts distributed 
across the U.S. and Europe and also traceroute data gath- 
ered by Paxson [26] in 1995. 


Internet routes can be highly circuitous. For instance, we 
observed a route from a host in St. Louis to one in Indi- 
ana (328 km away) that traverses a total distance of over 
3500 km (Section 4.2.1). By tracing the geographic path, 
we are able to automatically flag such anomalous routes, 
which would be difficult to do using purely network- 
centric information such as delay. We compute the /in- 
earized distance between two hosts as the sum of the 
geographic lengths of the individual links of the path. 
We then compute the ratio of the linearized distance of 
the path to the geographic distance between the source 
and destination hosts, which we term the distance ra- 
tio. A large ratio would be indicative of a circuitous and 
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acteristics using an overlay network. By actively moni- 
toring the quality of different paths, their alternate path 
selection mechanism can quickly recover from network 
failures and optimize application specific performance 
metrics. 


Consistent with these findings, our measurements indi- 
cate the existence of highly circuitous paths in the In- 
ternet. We also find that the circuitousness of a path is 
correlated with the minimum end-to-end latency along 
the path. 


2.2 Topology discovery and mapping 


Discovering and analyzing Internet structure has been 
the subject of many studies. Much of the work has fo- 
cused on studying topology purely at the network level, 
without any regard to geography. Recently several tools 
have been developed to map network nodes to their cor- 
responding geographic locations. A few Internet map- 
ping projects have used such tools to incorporate some 
notion of geographic location in their maps. 


The Mercator project [6] focuses on heuristics for In- 
ternet Map Discovery. The basic approach is to use 
traceroute-like TTL limited probe packets coupled with 
source routing to discover routers'. A key component of 
Mercator is the set of heuristics used to resolve aliases, 
i.e., multiple IP addresses corresponding to (possibly 
different interfaces on) a single router. The basic idea is 
to send a UDP packet to a non-existent port on a router 
and wait for the ICMP port unreachable response that it 
elicits. In general, the destination IP address of the UDP 
packet and the source IP address of the ICMP response 
may not match, indicating that the two addresses corre- 
spond to different interfaces on the same router. In our 
work we use geographic information to identi fy points of 
sharing in the network. We view this as complementary 
to network- level heuristics such as the ones employed in 
Mercator. 


The Internet Mapping Project [2] at Bell Labs also uses 
a traceroute-based approach to map the Internet from a 
single source. The map is colored according to the octets 
ofthe IP address, so portions corresponding to the same 
ISP tend to be colored similarly. The map, however, is 
not laid out according to geography. Other efforts have 
produced topological maps that reflect the geography 
of the Internet. Examples include the Map Net [24] and 
Skitter [28] projects at CAIDA and the commercial Ma- 
trix.Net service [25]. 


A number of tools have been developed for determin- 
ing the geographic location corresponding to an IP ad- 
dress. These tools use a variety of approaches to map 
an IP address to location: inferring location from Whois 


' Actually, router interfaces are discovered, not routers. 
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records [7] (e.g., NetGeo [11]), extrac ting location infor- 
mation from traceroute data (e.g., GeoTrack [13], Visu- 
alRoute [30]), determining the location coordinates us- 
ing delay measurements (e.g., Geo Ping [13]), etc. Our 
previous work on IP2Geo [13] focused on developing 
several tools, including GeoTrack, to do IP-to-location 
mapping. In this work, we use the GeoTrack tool to an- 
aly ze geographic properties of Internet routing. 


3 Experimental methodology 


In this section, we discuss our experimental methodol 
ogy. We present the details of our measurement test bed 
and the data sets we gathered. We also discuss Geo- 
Track, the tool we used to determine geographic paths 
in the Internet. 


3.1 Overview 


Since the goal of our work is to study the geographic 
properties of Internet routing, much of our measurement 
work has focused on gathering network path data using 
the traceroute tool [8]. We are not interested in study- 
ing the dynamic properties of Internet routing (e.g., how 
routes change over time), so we only record a single 
snapshot of the network path between a given pair of 
hosts. It may possible that some of the routes in our 
dataset are backup paths due to failures at the time of 
our measurement. However, we do not expect the ag- 
gregate statistics reported in this paper to be affected 
by such failures since our measurements were spread 
over a 2—month time period. We use traceroute to de- 
termine the network path between 20 traceroute sources 
and thousands of geographically distributed destination 
hosts. 


Once we have gathered the traceroute data, we use the 
GeoTrack tool to determine the location of the nodes 
along each network path where possible. GeoTrack re- 
ports the location at the granulanty of a city. We then 
use an on-line latitude-longitude server [18] to compute 
the geographic distance between the source and destina- 
tion of a traceroute as well as between each parr of adja- 
cent routers along the path. The latter enables us to com- 
pute the /inearized distance, which we define as the sum 
of the geographic distances between successive pairs of 
routers along the path. So if the path between A and D 
passes through B and C, then the linearized distance of 
the path from A to D is the sum of the geographic dis- 
tances betweenA & B,B&C,andC & D. 


As we discuss in Section 3.4.1, we are typically able to 
determine the location of most but not all routers. We 
simply skip the routers whose locations we are unable to 
determine. So in the above exampk, if the location of C 
is unknown, then we compute the linearized distance of 
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possibly anomalous route. In Section 4, we study cir- 
cuitousness of paths as a function of the geographic and 
network locations of the end-hosts. 


Our results indicate that the presence of multiple ISPs in 
a path is an important contributor to circuitous routing. 
We also find intra-ISP routing to be far less circuitous 
than inter-ISP routing. Our study of circuitousness of 
paths provides some insights into the peering and rout- 
ing policies of ISPs. Although circuitousness may not 
always relate to performance, it can often be indicative 
of a routing problem that deserves more careful exami- 
nation. 


There are two extremes to the routing policy that an ISP 
may employ: /rot--potato routing and co/ld-potato routing. 
In hot-potato routing, the ISP hands off packets to the 
next ISP as quickly as possible. In cold-potato routing, 
the ISP carries packets on its own network as far as pos- 
sible before handing them off to the next ISP. The for- 
mer policy minimizes the burden on the ISP’s network 
whereas the latter gives the ISP greater control over the 
end-to-end quality of service experienced by the pack- 
ets. As we discuss in Section 5.4, geographic informa- 
tion provides a means to quantify these notions by us- 
ing the geographic distance traversed within an ISP as 
a proxy for the amount of work performed by the ISP. 
In addition, we can also evaluate the degree to which an 
individual ISP contributes in the routing of packets end- 
to-end. Our analysis of properties of paths that traverse 
multiple ISPs is presented in Section 5. 


Another aspect of routing that bears careful examina- 
tion is its fault tolerance. Fault tolerance has generally 
been studied in the context of node or link failures based 
on network-level topology information. However, such 
topology information may be incomplete in that two 
seemingly independent nodes may actually be suscep- 
tible to correlated failures. For instance, a catastrophic 
event such as an earthquake or a major power outage 
might knock out all of an ISP’s routers in a geographic 
region. Geographic information can help in identifying 
routers that are co-located. In order to analyze the im- 
pact of correlated failures. we consider ISP topologies 
at the geographic level, where each node represents a 
geographic region such as acity. Using the geographic 
topology infonnation of several commercial ISPs gath- 
ered from CAIDA [24], we analyze the fault tolerance 
properties of individual topologies and the topology re- 
sulting from the combination of the individual ISP net- 
works (Section 6). We find that many tier-] ISPs are 
highly susceptible to single geographic node failures. 
The combined topology however exhibits better toler- 
ance to such failures. 


In summary, we believe geography is an interesting 


means for analyzing and quantifying network properties. 
In some cases, our analysis provides additional evidence 
for existing intuition about certain properties of Internet 
routing (e.g., hot-potato routing, circuitous paths). An 
important contribution of our work is a methodology for 
quantifying such intuttions using geographic informa- 
tion. Such quantification enables us, for instance, to au- 
tomatically flag circuitous paths, something that would 
be hard to using purely network-centric metrics (and no 
geographic information). 


2 Related work 


We classify related work into two categories: (a) Internet 
routing; (b) Topology discovery and mapping. 


2.1 Internet routing 


There are several properties of Internet routing that 
are of interest: end-to-end perfonnance, routing stabil- 
ity, routing convergence, etc. Previous work on Internet 
routing has focused either on measuring these properties 
or on modifying certain aspects of routing with a view 
to improving perfonnance. Our work shows how geo- 
graphic information can be used to measure and quan- 
tify certain routing properties such as circuitous routing, 
hot-potato routing and geographic fault tolerance. 


Network path information, obtained using the traceroute 
tool [8], has been used widely to study the dynamics of 
Internet routing. For instance, Paxson [14] studied vari- 
ous aspects of Internet routing using an extensive set of 
traceroute data. They include: routing pathologies, sta- 
bility of routing, and routing asymmetry. In relation to 
our work, he studies circuitous routing by deterniining 
the geographic locations of the routers in his dataset and 
uses geographic distance as a metric to quantify it. In 
addition, he uses the number of diffierent geographic lo- 
cations along a path to analyze the effect of hot-potato 
routing as a potential cause for routing asymmetry. We 
extend this work by studying circuitousness as a function 
of the geographic and network location of end-hosts. We 
also analyze the effiects of multiple ISPs in a path on its 
circuitousness. The distance 1atio metric that we define 
can be used to automatically flag anomalies such as the 
large-scale route fluttering identified in [9, 14]. 


Overlay routing has been proposed as a means to cir- 
cumvent the default IP routing. Savage et al. [17] study 
the effects of the routing protocol and its policies on the 
end-to-end performance as seen by the end-hosts. They 
show that for a large number of paths in the Internet, 
there exist paths that exhibit significantly better perfor- 
mance in terms of latency and packet loss rate. Recently, 
Andersen et al. [1] have proposed specific mechanisms 
for finding alternate paths with better performance char- 
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the path from A to D as the sum of the geographic dis- 
tances between A & B and B & D. Clearly, skipping over 
C would lead us to underestimate the linearized dis tance. 
However, as noted in Section 3.4.1, most of the skipped 
nodes are in the vicinity of the either the source or the 
destination, so the error introduced in the linearized dis- 
tance computation is small. 


4000 km 
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Figure 1: Locations of our traceroute sources in the 
U.S. Note that there were 17 hosts in 15 locations (two 
hosts each in Seattle and Berkeley). 


3.2 Measurement testbed 


We used 20 geographtally distributed hosts as the 
sources for our traceroutes. 17 of these hosts were lo- 
cated in the U.S. (Figure 1) while 3 were located in Eu- 
rope (at Stockholm (Sweden), Bologna (Italy), and Bu- 
dapest (Hungary)). The geographical diversity in source 
locations enables us to study the variations in routing 
properties as seen from different vantage points. For lo- 
gistical reasons, it was convenient for us to locate the 
traceroute sources on unwersity campuses. 18 out of 
the 20 traceroute sources fell into this category. Furth er- 
more, 9 of the 15 university locations we considered in 
the U.S. were connected by the Internet2 backbone [19]. 
To add some diversity, we had one source 1n Berkeley, 
CA connected to ahome cable modem network (in addi- 
tion to a host at the University of California at Berkeley) 
and another in Seattle, WA connected to the Microsoft 
Research net work (in addition to a host at the University 
of Washington at Seattle). These two pais ofsources al- 
low us to study (albeit to a limited extent*) what impact, 
if any, the nature of the source’s connectivity has. 


The destination set for the traceroutes comprised several 


?We could have used a diverse set of public traceroute 
Servers [22] to overcome ths limitation. However, the large 
volume of traceroutes that We were looking to run from each 
Source precluded this. 
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thousand hosts. These destinations hosts fell into 4 cate- 
gor ies: 


1. UnivHosts: 265 Web savers and other hosts lo- 
cated on university campuses in the U.S. The hosts 
were distributed across 44 of the SO states in the 
US. 


2. LibWeb: 1,205 Web severs of public libraries [21] 
distributed across 49 states in the U.S. We also en- 
sured that the distribution of the geographic loca- 
tions ofthese libraries is not skewed. 


3. TVHosts: 3,100 client hosts in the U.S. that con- 
nected to an on-line TV program guide. A majority 
of these clients were located on non-academic net- 
works such as America Online (AOL). 


4. EuroWeb: 1,092 Web servers [23] distributed 
across 25 countries in Europe. 


For ease of exposition, we sometimes refer to Unt 
vHosts, LibWeb, and TVHosts as the U.S. hosts and Eu- 
roWeb as the European hosts. 


This diverse Set of destination hosts ena bles us to investi- 
gate the properties of Internet routing in the context ofa 
large set of ISPs. In all, we traced appro ximately 84,000 
end-to-end paths between our traceroute sources and the 
destination hosts during October-December 2000. Our 
data s available online at [27]. 


3.3 Dataset from 1995 


To study the temporal variations in Internet properties, 
we uSe the traceroute data set collected by Paxson in 
1995 [26]. The data set includes traceroutes conducted 
between pars of hosts drawn from a set of 33 hosts dis- 
tributed across (mainly academic sites in) the U.S., Eu- 
rope, South Korea, and Australia. 


Desptte the fact that the 1995 dataset contains far fewer 
paths than the 2000 data set, it provides an interesting 
data pomt for comparison. The 1995 data set was gath- 
ered m late 1995, about 6 months after the demise of the 
NSFNET back bone (which used to provide connectivity 
to academe sites inthe U.S.) and early in the life of the 
commercial Intemet. 


3.4 GeoTrack 


Once we have gathered traceroute data, we use the Geo- 
Track tool, which we developed previously as part of the 
IP2Geo project [13], to translate the network path be- 
tween a pair of hosts to the corresponding geograph ic 
path. GeoTrack tries to infer the location of a router 
based on its DNS name. Network operators often assign 
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geographically meaningful names to routers*, presum- 
ably for administrative convenience. For example, the 
name corerouter! .SanFrancisco.cw.net corresponds to a 
router located in San Francisco. However, not all router 
names are recognizable (i.e., some router names may not 
contain an indication of location). 


Here is a brief outline of how GeoTrack works; please 
refer to [13] for a more detailed description. The DNS 
name of the router is parsed to determine if it contains 
any location codes. GeoTrack uses a database of approx- 
imately 2000 location codes for cities inthe U.S. and in 
Europe. Each ISP tends to use its own naming conven- 
tion, so there may be multiple codes for each city (e.g., 
cicg, chcgil. cgcil, chi, chicago, ord for Chicago, |L). 
GeoTrack incorporates ISP-specific parsing rules that 
specify the subset of valid codes and the position(s) in 
which they may appear in the router names. 


We use the domain name of a router to decide which 
ISP it belongs to. While this heuristic works reasonably 
well, it is not perfect because multiple domain names 
may correspond to the same ad ministrative domain (e.g., 
alter:net and uu.net), often due to the merger of what 
were once independent networks. For the same reason, 
even AS numbers would not enable us to determine the 
ad ministrative domain boundaries with complete accu- 
racy. 


3.4.1 Coverage of GeoTfrack 


Of the 11,296 .met router names in our traceroute 
data set, 7842 were recognizable (approximately 70%). 
We compiled a list of 13 major ISPs with nation- 
wide backbones in the U.S. or with intemational cov- 
erage: Sprintlink, AT&T, Cable and Wireless, Internet 2, 
Verio, BBNPlanet*, Qwest, Level3, Exodus, PSINet, 
UUNET/Alter.net, VBNS, and Global Crossing. We 
found that 5,966 of the 6,859 router names for these ma- 
jor ISPs were recognizable (§7%). In some individual 
cases, such as AT&T and UUNET, the recognizability 
w as in excess of 95%. 


By manual inspection, we found that a large chunk ofthe 
router names which are unrecognizable by our tool have 
no meaningful codes to decipher their locations. Many 
unrecognizable router names tend to be concentrated in 
regional or campus networks. (For example, cmu.psc.net 
is a node in Pittsburgh, PA. However, since it does not 
contain a valid city or aiiport code, GeoTrack is unable 


“To be precise, DNS names arc associated with router inter- 
jaces, not routers themselves. However, for case of exposition 
we simply use the term router. 

“BBNPlanct is now called Genuity, but the router names 
arc still in the bbnplaner.net domain. 


Percenlage Routers Recognized in a Path 
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Figure 2: The recognizability of router names as a 
function of the position of the router in the end-to- 
end path. The position is quantified by dividing the 
number of hops leading up to the router by the total 
number of hops end-to-end. 


to recognize its location.”) Figure 2 shows that recogniz- 
ability is lowest close tothe start and the end of the path. 
(The peak corresponding to the very beginning of the 
path is due to the source location always being known.) 
Thus most of the unrecognizable nodes are typically lo- 
cated in the vicinity of the source or the destination, so 
the resulting error in linearized distance is minimal. 


In the case of the 1995 data set, GeoTrack is able to rec- 
ognize 1,289 out of 1,531 router names (approximately 
84%). Interestingly, we noticed a huge difference in the 
naming convention used in 1995 and 2000. Hence we 
needed to create a new set of codes for the 1995 data set. 


3.4.2 Possible inaccuracies 


First, the city codes used in GeoTrack for computing 
the location of router given its label are manually de 
termined and encoded. Hence there is always a possi- 
bility that the location of a router as determined by Geo- 
Track is incorrect. However, we have greatly reduced the 
possibility of such errors by using delay-bas ed veri fica- 
tion, ISP specific parsing rules and manual inspection. In 
delay-based verification, we perform the following sim- 
ple check: ifthe difference between the minimum RTTs 
to two adjacent routers in a path is not high, the dstance 
between them cannot be large. This simple check helped 
us distin guish between two cities named Geneva that had 
simi lar city codes —~ one in Switzer land and the other in 


>Of course, it is possible to include pse and emu as codes. 
However, we refrain from doing so since we only want to in- 
clude those codes in GcoTrack that inherently indicate loca- 
tion. Doing otherwise would lead us down the path of exhaus- 
tive tabulation, which is undesirable. 
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Texas. We have enumerated specific rules for 52 differ- 
ent ISPs (all major ISPs in our data set) which specify 
the exact position where a city code is embedded ina 
label. This, in conjunction with ISP specific city-codes, 
greatly reduces the chances of a wrong location output. 
We have also manually inspected the geographic paths 
corresponding to a large samp le of our traceroute data to 
check for any possible errors. 


Second, the linearized distance computed can be dis- 
torted if the geographic locations of many routers in a 
path are unknown. We reduce this distortion by restrict- 
in g our analysis to paths that have at least 4 recognizable 
intermediate routers. The lmearized distance of a path 
can also be skewed due to intra-metro distances. Intra- 
metro distances will affect our analysis only for small 
values of linearized distances. To reduce this skew, we 
only consider paths with a linearized distance greater 
than 100 kms in our study. 


3.5 Limitations 


We now discuss the limitations of our study arising both 
due to the inherent limitations of geographic information 
and due to limitations of our experimental methodology. 


1. Geography does not determine performance: 
There is not a pafect relationship between geo- 
graphic distance and network performance. It is 
possible that a circuitous path yields better perfor- 
mance than a less circuitous one. For instance, the 
most optimal path between certain countries may 
be via the U.S. even if that means a large detour 
in geographic terms. However, in Section 4.5, we 
show that there exits a strong correlation between 
the minimum end-to-end delay between two end- 
hosts and the linearized distance of ther connect- 
ing path. In light of this, we view our geographic 
analysis of network paths as providing (a) hints on 
paths that are potentially anomalous and should be 
examined more closely to determine if they are in- 
deed anomalous, (b) an indication of how much im- 
provement there could be in end-to-end latency if a 
non-circuitous path between source and destination 
were feasible, and (c) a way to quantify network 
properties such as hot-potato routing, which may 
provide new insight into these properties. 


2. IP-level topology is incomplete: Our linearized 
distance computation only considers the router- 
level (i.e., IP-level) topology. We have no way 
of discovermg the underlyng physical topology 
(which may be based on ATM, SONET, or other 
technologies), so in general we would unceres- 
timate the linearized distance. While this is a 
limitation of our methodology, we note that the 
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trend in high-speed networks (OC-48 and faster) 
is away from separate layer-2 and layer-3 architec- 
tures (e.g., IP-o ver-ATM) and towards an all-IP net- 
work [15]. This trend increases the applicability of 
our methodology. 


4 Circuitousness of Internet paths 


In this section, we examine the nature of circuitous 
routes in the Internet. Since there ts not a standard mea- 
sure of circuitousness, we de fine a metric, distance ratio, 
as the ratio of the linearized distance of a path to the ge- 
ograp hic distance between the source and destination of 
the path. The distance ratio reflects the degree to which 
the network path between two nodes deviates from the 
direct geographic path between the nodes. A ratio of ! 
would indicate a perfect match (i.e., an absolutely di- 
rect route) while a lar ge ratio would indicate a circuitous 
path. 


We present several different analysis with a view to 
studying the impact of spatial factors as well as tempo- 
ral factors. Under spatial factors, we study the effect of 
the geographic and network locations of end-hosts on 
the circuitousness of paths. To study temporal proper- 
tics, we compare the circuitousness of paths drawn from 
Paxson’s 1995 data set to the ones drawn from our 2000 
data set. Finally, we analyze the relationship between 
the minimum delay between two end-hosts and the lin- 
earized distance along ther path. 


4.1 Effect of network location 


In this section, we will vary the network location of the 
end-hosts (source and destination) and study its effect 
on the distance ratio of paths. In our first analysis, we 
fix a source and compare the distance ratio of paths to 
destinations in different networks. In our second analy- 
SiS, we compare the distance ratio of paths from different 
sources in the same geographic location but with differ- 
ent network connectivities to a set of end-hosts in the 
Same network. 


4.1.1. Paths from a single source 


We consider paths from our traceroute sources in U.S. 
universities to two vared set of end-hosts: UnivHosts 
and TV Hosts. Many ofthe hosts in UnivHosts (including 
our sources) connect to the Intermet2 high-speed back- 
bone via a local GigaPO P. So much of the wide-area path 
between our sources and a host in UnivHosts traverses 
the Internet2 backbone. On the other hand, TVHosts is 
a more diverse set that includes hosts located in vari- 
ous commercial networks (AOL, MSN, @Home, etc.) 
as well as university campuses. So the wide-area paths 
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from our sources to the hosts in TVHosts typically tra- 
verse one or more commercial ISP backbones. 


UC Berkeley: UnivHosts vs TVHosts 
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Figure 3: CDF of distance ratio for paths from UC 
Berkeley to UnivHosts and TVHosts. 


This difference between the two groups of destination 
hosts is reflected in the cumulative distribution ftinction 
(CDF) of the distance ratio for the two cases. As Fig- 
ure 3 shows (for source in UC Berkeley), the distance 
ratio is close to | for many of the destinations. The ratio 
is 1.1 or less (corresponding to a linearized distance that 
exceeds the end-to-end geographic distance by no more 
than 10%) for 55% of the destinations n UnivHosts and 
45% in TVHosts. This finding is consistent with the 
rich Internet connectivity of the San Francisco Bay Area 
(where UC Berkeley is located). The area includes sev- 
eral public Internet exchanges (e.g., MAE-West, PAIX, 
etc.) as well as private peermg points. So a path from the 
UC Berkeley host to a destination host is often (but not 
always) able to transition to the latter's ISP within the 
SF bay area itself. So there is little need to take a detour 
throu gh another city Just to transition to the destination’s 
[SP. 


There is a far more pronounced difference between the 
UnivHosts and TVHosts cases if we look at the tail of the 
distribution. For instance, at the 90th percentile mark, 
the distance ratio is 1.4] in the case of UnivHosts but 
1.72 in the case of T VHosts: in other words, the detour 
is 1.75 times as large for TVHosts destinations as it is 
for UnivHosts (72% versus 41%). The paths to some of 
the hosts in TVHosts tend to be more circuitous because 
they traverse multiple commercial ISPs whose peering 
relationships may cause detours in the end-to-end path. 
We discuss this issue in more detail in Section 5. We 
observe qualitatively the same trends for other university 
sources as well; 1.e., the distance ratio tends to be smaller 
for paths leading to UnivHosts compared to TVHosts. 


4.1.2 Multiple sources in the same location 


We now consider paths from pairs of hosts in the same 
location but on entirely different networks to destina- 
tions in the UnivHosts set. We consider two such pairs of 
traceroute sources: (a) a machmne on the Berkeley cam- 
pus and another also in Berkeley but on @Home’s cable 
modem network, and (b) a machine at the University of 
Washington (UW) campus in Seattle and another on the 
Microsoft Research network 10 km away. 
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Figure 4: CDF of distance ratio for paths from pairs 
of co-located sources to UnivHosts. 


Figure 4 shows the CDF of the distance ratio for all 4 
sources. For the two sources located in Berkeley, we find 
that the one on the university campus has a significantly 
Smaller distance ratio, especially at the tail of the distri- 
bution. For instance, the 90th percentile of the distance 
ratio for the UC Berkeley source is 1.41 while that for 
the cable modem source is 1.83. Since the destination 
set 1s UnivHosts, the UC Berkeley source tends to have 
more direct routes (via Intemet2) than the cable modem 
client has (via @Home and other commercial [SPs). 


We observe a similar trend for the UW-Microsoft pair. 
The UW source has more direct routes to other univer- 
sity hosts than does the Microsoft source. For instance, 
the path from Micros oft to the University of Chicago fol - 
lows a highly circuitous route through BB NPlanet’s (Ge- 
nuity) network. The geographic path traversed includes 
Los Angeles, Carlton (TX), Indianapolis and Chicago 
(in that order). The linearized distance of the path is 
4976 km while the geographic distance between Seattle 
and Chicago 1s only 2795 km. In contrast, the path from 
UW (via Internet2) is far more direct: it passes through 
Denver, Kansas City, Indianapolis, and finall y Chicago, 
for a total Imearized distance of 3533 km. 


These results indicate that the nature of network connec- 
tivity of the source and the destination has a significant 
impact on how direct or circuitous the network paths are. 
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4.2 Effect of geographic location 


The geographic location of a source indirectly de- 
termines its network connectivity. Sources near well- 
connected geographic locations like the Bay Area can 
potentially have less circuitous routes since many com- 
mercial ISPs will have a POP very close to them. To 
better understand the effect of geographic location, we 
compare the distance ratios of sources in different lo- 
cations to a common set of destination end-hosts. We 
extend this analysis to study the role of network struc- 
tures in different continents (U.S and Europe) on the cir- 
cuitousness of paths. 


4.2.1 Multiple sources in different locations 


We consider paths from sources in three geographically 
distributed locations in the U.S.: Stanford, Washington 
University at St. Louis (WUSTL), and the University of 
North Carolina (UNC). The destination set 1s LibWeb, 
which is a larger and more diverse set than the Uni- 
vHosts set considered in Section 4.1.2. 


LioWeb: Oifforartt sout ces 





Figure 5: CDF of distance ratio for paths from mul- 
tiple sources to LibWeb. 


As shown in Figure 5, the distance ratio tends to be 
the smallest for paths originating from Stanford and the 
largest for those originating from WUSTL. Stanford, 
like Berkeley, is located in the San Francisco Bay area, 
which is well served by many of the large ISPs with na- 
tionwide backbones. In contrast, WUSTL is much less 
well connected. Almost all paths from WUSTL enter 
Verio’s network in St. Louis and then take a detour e1- 
ther to Chicago in the north or Dallas in the south. At 
one of these cities, the path transitions to another major 
ISP such as AT&T, Cable & Wireless, etc. and proceeds 
to the destination. Any detour 1s particularly expensive 
in terns of the distance ratio because the central loca- 
tion of St. Louis in the U.S. means that the geographic 
distance to various destinations ts relatively small. 


In general, paths (such as those from WUSTL) that tra- 
verse significant distances in the backbones of two or 
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more large ISPs tend to be more circuitous than paths 
(such as those from Stanford) that traverse much of the 
end-to-end distance in the backbone of a single ISP (re- 
gardless of who the ISP is). One example of a highly 
circuitous path we found involved two large ISPs, Verio 
and AT&T. The path originates in WUSTL in St. Louis 
and terminates at a host in Indiana University, 328 km 
away. However, the geographic path goes from St. Louis 
to New York via Chicago, all on Verio’s network. In New 
York, it transitions to AT&T’s network and then retraces 
its path back through Chicago to St. Louis, before finally 
heading to Indiana. The linearized distance is 3500 km, 
more than IO times as much as the geographic distance. 
We examine the impact of multiple ISPs in greater detail 
in Section 5. 


While the specific findings pertaining to Stanford and 
WUSTL may not be important in general, our results 
suggest that the distribution of the distance ratio is con- 
sistent with our intuition about the richness of connec- 
tivity of hosts in different geographic locations. 


4.2.2 U.S. versus Europe 


We now analyze the distance ratios for paths in Europe 
and compare these to the distance ratios for paths in 
the U.S. We consider paths from the 17 U.S. sources 
to destinations in the LibWeb set and also paths from 
the 3 European sources to destinations in the EuroWeb 
set. Thus, all of these paths are contained either entirely 
within the U.S. or entirely within Europe. We do not con- 
sider paths from U.S. sources to European destinations 
(or vice versa) because the distance ratio for such paths 
tends to be dominated by long transatlantic links (which 
tends to push the ratio towards !). 


US-US vs Eurhs0-Ewepo 
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Figure 6: CDF of distance ratio for paths within the 
U.S. and those within Europe. 


In Figure 6, we show the distribution of the distance ra- 
tio for three sources: Berkeley in the U.S., and Stock- 
holm (Sweden) and Bologna (Italy) in Europe. We ob- 
serve that the distance ratio tends to be larger for the Eu- 
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ropean sources compared to Berkeley, especially in the 
tail of the distribution. We attribute this to three causes. 


First, paths in Europe tend to traverse multiple regional 
or national ISPs. The complex peering relationships be- 
tween these ISPs often results in convoluted paths. For 
instance, a path from Bologna to a host in Salzburg, 
Austria traverses 3 ISPs — GARR (Italian Academic 
and Research Network), Eqip/Infonet, and KPNQwest 
(a leading pan-European ISP based in the Netherlands) 
— and passes through Milan (Italy), Geneva (Switzer- 
land), Paris (France), Amsterdam (Netherlands), Frank- 
furt (Gennany), and Vienna (Austria). The linearized 
distance of the path is 2506 km whereas the geographic 
distance between Bologna and Salzburg 1s only 383 km. 


Second, 1n some cases the path from a European source 
to a European destination passes through nodes in the 
U.S. For instance, a path from Stockholm (Sweden) to 
Zagreb (Croatia) passes through a node in New York 
City belonging to Teleglobe, a large international ISP. 
In the event that the [SPs in Europe have better connec- 
tivity to ISPs in U.S., it would be appropriate for them to 
route their traffic through U.S. though the route may be 
more circuitous. Third, geographic distances in Europe 
tend to be smaller than the ones in U.S. As in the case of 
St Louis in Section 4.2.1, small detours in routing can be 
particularly expensive in terms of the distance ratio for 
paths between end-hosts in Europe. 


4.3. Temporal properties of routing 


To better understand some of the temporal properties of 
routing, we compare the distribution of the distance ra- 
tio computed from our 2000 data set with that computed 
from Paxson’s 1995 data set [20]. The paths in the 1995 
data set correspond to traceroutes conducted amongst 
the 33 nodes (mainly at academic locations) that were 
part of the testbed. We considered 340 paths between 
the subset of 20 nodes that were located in the U.S. The 
1995 data set includes multiple traceroute measurements 
between each pair of hosts. In our study, we only use 
data from one successfil traceroute between each pair 
of hosts. To keep the nature of the measurement points 
similar, in the 2000 data set we only consider paths be- 
tween the 15 source hosts located at universities and the 
265 hosts in the UnivHosts set. 


Figure 7 plots the CDF of the distance ratio for the 1995 
and 2000 data sets. By observing the tail of the cumu- 
lative distribution, we find that the distance ratios tend 
to be smaller in the 2000 data set. This improvement is 
not surprising because the Internet 1s more richly con- 
nected today than it was 5 years ago. There now exist 
direct point-to-point links between locations that were 
previously connected only by an indirect path. 
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Figure 7: CDF of distance ratio for paths in Paxson’s 
1995 data set and our data set from 2000. 


4.4 Correlation between delay and dis- 
tance 


Finally, we analyze the relationship between geography 
and the end-to-end delay along a path. Though geogra- 
phy by itself cannot provide any information about many 
performance characteristics like bandwidth, congestion 
along a path, the linearized distance of a path does en- 
force a minimum delay along a path (propagation delay 
along a path). 


To study this correlation, we use the TV Hosts data set 
since it represents a wide variety of end-hosts. In our 
traceroute data, we obtain 3 RTT samples for every 
router along the path. Since not all routers in a path 
are recognizable, we consider the minimum RTT, geo- 
graphic distance and linearized distance to the last rec- 
ognizable router along the path. In this analysis, we re- 
strict ourselves to the list of probes in the U.S. 
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Figure 8: CDF of minimum end-to-end RTT to 
TVHosts for different ranges of linearized distances 
and geographic distances of paths 


Figure 8 illustrates the correlation of the minimum RTT 
along a path to the linearized distance of a path and the 
geographic distance between the end-hosts. We make 
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three important observations. First, at low values of the 
linearized distance there exists a strong correlation be- 
tween the delay and linearized distance for a large frac- 
tion of end-hosts especially for smal! values of linearized 
distances. We expect this correlation to be much stronger 
as we compute the minimum over a larger number of 
samples. Second, linearized distance along a path does 
enforce a minimum end-to-end RTT which is an im- 
portant performance metric for latency sensitive appli- 
cations. Third, the minimum RTT between two end- 
hosts has lesser correlation to the geographic distance 
between them as compared to the linearized distance of 
the path connecting them. We observe that for a given 
range of linearized distance of a path, the RTT variation 
is much smaller than its variation for the same range of 
geographic distance between the end-hosts. Hence lin- 
earized distance of a path conveys more about the mini- 
mum RTT characteristics of a path than merely the geo- 
graphic distance between the end-hosts. We also verified 
that these observations hold across the other datasets we 
collected. The coarse correlation between minimum de- 
lay and geographic distance was used in building GeoP- 
ing, an I[P-to-location mapping service [13]. 


4.5 Summary of Results 


From Sections 4.1 and 4.2, we observe that the cir- 
cuitousness of a route depends on both the geographic 
and network location of the end-hosts. In many cases, 
the trends we observe in the distance ratio are consis- 
tent with our intuition. A large value of the distance ra 
tio enables us to automatically flag paths that are highly 
circuitous, possibly (thou gh not necessarily) because of 
routing anomalies. Finally, we show that the minimum 
delay between end-hosts and the linearized distance of 
their path are strongly correlated. This relationship 1n- 
dicates that the circuitousness of a route does have an 
effect onthe delay observed along the route (though this 
does not completely dictate the performance along the 
route). 


5 Impact of multiple ISPs 


Our analysis in Section 4 focused on the characteris- 
tics of the end-to-end path from a source to a destt 
nation. The end-to-end path typically traverses multi- 
ple autonomous systems (ASes). Some of the ASes are 
stub networks such as university or corporate networks 
(where the source and destination nodes may be lo- 
cated) whereas others are ISP networks. The relation- 
ships between these networks is often complex. There 
are customer-provider relationships (such as those be- 
tween a university network and its ISP or between a re- 
gional ISP and a nationwide ISP) and peering relation- 
ships (such as those between two nationwide ISPs). A 
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stub network may be multi-homed (i.e., be connected to 
multiple providers ). Two nationwide ISPs may peer with 
each other at multiple locations (e.g., San Francisco and 
New York). 


These complex interconnections between the individual 
netw orks have an impact on end-to-end routing. In this 
section, we show that geography can indeed be used 
as a means to analyze these complex interconnections. 
Specifically, we investigate the following questions: (a) 
are Internet paths within individual ISP networks as cir- 
cuitous as end-to-end paths?, (b) what impact does the 
presence of multiple ISPs have on the circuitousness of 
the end-to-end path?, (c) what is the distribution of the 
path length within individual ISP networks, and (d) can 
geography shed light on the issue of hot-potato versus 
cold-potato routing? 


5.1 Circuitousness of end-to-end paths 


versus intra-ISP paths 


End-End Path vs ISP Path 
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Figure 9: CDF of distance ratio of end-to-end paths 
versus that of sections of the path that lie within in- 
dividual ISP networks. 


We now take a closer look at the circuitousness of end- 
to-end Internet paths, as quantified by the distance ratio. 
We compare the distance ratio of end-to-end paths with 
that of sections of the path that he within individual ISP 
networks. We consider paths from the U.S. sources to 
the Lib Web data set for this analysis. 


As shown in Figure 9, the distance ratio of end-to-end 
paths tend to be significantly larger than that of intra-ISP 
paths. In other words, end-to-end paths tend to be more 
circuitous than intra-ISP paths. Furthermore, the distr- 
bution of the ratio tends to vary from one ISP to another, 
with Internet2 doing much better than the average and 
Alter.Net (part of UUNET) doing worse. 


We believe the reason that end-to-end paths tend to more 
circuitous is that the peering relationship between ISPs 
may create detours that would otherwise not be present. 
Inter-domain routing in the Internet largely uses the 
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BGP [16] protocol. BGP is a path vector protocol that 
operates at the level of ASes. It offers limited visibility 
into the internal structure of an AS (such as an ISP net- 
work). So the actual cost of an AS-hop (in terms of la- 
tency, distance, etc.) 1s largely hidden at the BGP level. 
As a result the end-to-end path may include large de- 
tours. 


Another issue is that ISPs typically employ BGP policies 
to control how they exchange traffic with other ISPs (1.e., 
which traffic enters or leaves their network and at which 
ingress/egress points). The control knobs made available 
by BGP include import policies such as assigning a local 
preference to indicate how favorable a path is and export 
policies such as assigning a multiple exit discriminator 
to control how traffic enters the ISP network [5]. These 
policies are often influenced by business considerations. 
For instance, packets from a customer of ISP A to a cus- 
tomer of ISP B in the same city might have to go via 
a peering point in a difterent city simply because a lo- 
cal service provider in the origin city who pcers with 
both ISP A and ISP B does not provide transit service 
between the two ISPs. 


Such BGP policies may partly explain the example men- 
tioned in Section 4.2.1, where packets from a host in St. 
Louis to a nearby location had to travel on Verio’s net- 
work all the way to New York to enter AT&T’s network. 
We have seen several other such examples: a path from 
Austin, TX to Memphis, TN where the transition from 
Qwest to Sprintlink happens in San Jose, CA; a path 
from Madison, WI to St. Louis. MO where the transi- 
tion from BBNPlanet to Qwest happens in Washington 
DC. We do not have specific information on the policies 
that were employcd by these ISPs, so we cannot make a 
definitive claim that BGP is to blame. However, in view 
of the complex policies that come into play in the context 
of inter-domain routing, tt is not surprising that end-to- 
end paths tend to be more circuitous. 


In contrast, routing within an ISP network is much more 
controlled. Typically, a link-state routing protocol, such 
as OSPF [12], is used for intra-domain routing. Since the 
internal topology of the ISP network 1s usually known to 
all of its routers, routing within the ISP network tends to 
be close to optimal. So the section of an end-to-end path 
that lies within the ISP’s network tends to be less cir- 
cuitous. Referring again to the example in Section 4.2.1, 
both the St. Louis — Chicago -+ New York path within 
Verio’s network and the New York — Chicago — St. 
Louis path within AT&T’s network are much less cir- 
cultous than the end-to-end path. 


However, this does not mean that intra-ISP paths are 
never circuitous. As noted in Section 4.1.2, we found a 
circuitous path through BBNPlanet (Genuity), from Mi- 


crosoft Research in Seattle to the University of Chicago, 
that has a linearized distance of 4976 km whereas the ge- 
ographic distance is only 2795 km. This does not imply 
that the path is necessary sub-optimal. In fact, the cir- 
cultous path may be best from the viewpoint of network 
load and congestion. The point is that while geography 
provides useful insights into the (non-)optimality of net- 
work paths, it only presents part of the picture. 


5.1.1 Impact of path length on circuitousness 


Geographic Distance vs Distance Ratio 
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Figure 10: Distance ratio versus the geographic dis- 
tance between the ends of a path. The median dis- 
tance ratio is computed over 400 km buckets (0-400 
km, 400-800 km, and so on). A minimum distance 
threshold of 100 km is imposed to prevent the ratio 
from blowing up, so the first bucket is actually 100- 
400 km. 


One question that arises from the above analysis is 
whether there is a connection between the circuitous- 
ness of a path and its length (1.e., the geographic dis- 
tance between the two ends of the path). In other words, 
are longer paths inherently more circuitous, regardless 
of whether they traverse one ISP or many? I fso, the fact 
that end-to-end paths tend to be longer than intra-ISP 
paths may explain the greater circuitousness of the for- 
mer. 


However, as shown in Figure 10, the trend is quite the 
opposite. The distance ratio tends to decrease as the geo- 
graphic distance increases.° The reason is that the impact 
of a detour is smaller (in relative terns) in the context 
of a longer path. The distance ratio for the end-to-end 
path tends to be greater than that for the intra-ISP path, 


“The jaggcdness of the curves arises because of the large 
variance in distance ratio for small valucs of geographic dis- 
tance. The 5th and 95th percentile marks for the 100-400 km 
bucket arc (1.00,20.50) for the end-to-end case and (1.00,4.22) 
for the intra-ISP casc. The corresponding marks for the 4000- 
4400 km bucket are (1.01,1.57) for the end-to-end case and 
(1.00,1.18} for the intra-ISP casc. 
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regardless of geographic distance. Thus the greater cir- 
cuitousness of end-to-end paths is most likely due to the 
presence of multiple ISP networks in the path. 


5.2. Impact of multiple ISPs on circuitous- 
ness 


Circultous vs Noncircuitous routes 


Cumulative Distndution 
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Figure 11: CDF of the fraction of the end-to-end path 
that lies within the top 2 ISPs in the case of circuitous 
paths and non- circuitous paths. 


In Section 5.1 we hypothesized that the presence of mul- 
tiple ISPs in an end-to-end path contributes to the cir- 
cultousness of the path. We now examine this issue more 
carefully. We classify end-to-end paths into two cate- 
gories — non-circuitous (distance ratio < 1.5) and cir- 
cuitous (distance ratio > 2).’ For each path in either cat- 
egory, we identify the top two ISPs that account for most 
of the end-to-end linearized distance. We then compute 
the fraction of the end-to-end linearized distance that 
is accounted for by the top two ISPs, and denote these 
fractions by inax, and inaxy. Forexample, if an end-to- 
end path with a linearized distance of 1000 km traverses 
400 km in AT&T’s network and 300 km in UUNET’s 
network (and smaller distances in other networks), then 
max; = 0).4 and max, = 0.3. Note that it 1s possible 
for max, to be 1.0 (and so maxo to be 0.0) if the entire 
end-to-end path traverses just one ISP network. We note 
that local-area networks confined to a city (e.g., a uni- 
versity network) contribute nil to the linearized distance 
and therefore are ignored. 


Figure 11 shows the CDF of max, and max: for the cir- 
cultous and non-circuitous paths. The difference in the 
characteristics of these two categories of paths 1s strik- 
ing. The max, and maxg curves are much closer to- 
gether in the case of circuitous paths than in the case of 


"While the choice of these thresholds is arbitrary, they cap- 
turc the intuitive notion of circuitous and non-circuitous routcs. 
Note that there may bc paths that do not fall into cither cate- 
gory. 
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non-circuitous paths. In other words, in the case of cir- 
cultous paths, the end-to-end path traverses substantial 
distances in each of the top two ISPs (and perhaps other 
ISPs too). In contrast, non-circuitous paths tend to be 
dominated by a single ISP. For instance, the median val- 
ues of max, and maxz2 in the case of circuitous paths is 
approximately 0.65 and 0.3, respectively. In other words, 
the top two ISPs account for 65% and 30%, respectively, 
of the end-to-end path in the median case. However, the 
fractions for the non-circuitous paths are approximately 
95% and 4%, respectively — much more skewed in favor 
of the top ISP. 


Distance Ratio va Number of Major ISPs in Path 
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Figure 12: CDF of the distance ratio as a function of 
the number of major 1SPs traversed along an end-to- 
end path. There were few paths that traversed more 
than 3 major ISPs. 


We also consider the impact of the number of ma jor ISPs 
traversed along an end-to-end path on the distance ratio. 
Figure 12 shows a clear trend: the distance ratio tends to 
increase as the path traverses a greater number of ISPs. 
For instance, the median distance ratios are 1.18, 1.25, 
and 1.38, respectively with 1, 2, and 3 major ISPs. The 
90th percentile of the distance ratio is 1.81, 2.26, and 
2.35, respectively. A path that traverses a larger num- 
ber of major ISPs may span a greater distance. How- 
ever, as noted in Section 5.1.1, this would not explain 
the larger distance ratio. In fact, a greater geographic dis- 
tance would tend to make the distance ratio smaller, not 
larger 


These findings reinforce our hypothesis that there is 
a correlation between the circuitousness of a path (as 
quantified by the distance ratio) and the presence or ab- 
sence of multiple ISPs that account for substantial por- 
tions of the path. 


5.3 Distribution of ISP path lengths 


In this section, we further examine the distribution of the 
end-to-end linearized distance that is accounted for by 
individual ISPs. We wish to understand how the effort 
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of carrying traffic end-to-end over a wide-area path is 
apportioned between different !SPs. For each of the 13 
nationwide ISPs in the U.S. listed in Section 3.4.1, we 
consider the set of paths that traverse one or more nodes 
in that ISP’s network. For each such path, we compute 
the fraction of the end-to-end path that lies within the 
ISP’s network. 


Path Distribution across ISPs 





e _—_=_d 
-—=- ee 


Figure 13: CDF of the fraction of the end-to-end path 
that lies within individual ISP networks. 


Figure 13 plots the CDF of this fraction for a few ISPs. In 
each case, we consider the paths from the U.S. university 
sources to the LibWeb data set. We observe that the dis- 
tributions look very different. For instance, the median 
fraction of the end-to-end path that lies within Sprintlink 
is only about 0.35 whereas the corresponding fraction 
for UUNet is 0.75 and forIntermet2 is over 0.9. Intemet2 
is a high-speed backbone network that connects many 
university campuses in the U.S. An end-to-end path that 
traverses Internet2 typically originates and terminates at 
university campuses. Therefore, the Internet2 backbone 
accounts for an overwhelming fraction of such end-to- 
end paths. UUNET accounts for a larger fraction of the 
paths that traverse its backbone than any other commer- 
cial ISP we considered. This may reflect the close rela- 
tionship between UUNET’s parent company, Worldcom 
(which runs the vBNS backbone [29]), and academic 
sites. 


The much smaller fraction in the case of Sprintlink is 
harder to explain definitively. From our conversations 
with people at Sprint [3, 10], we have learned that aca- 
demic sites are not their major customers, so Sprintlink 
participates minimally in carrying academic traffic. The 
location of our traceroute sources at academic sites may 
explain why Sprintlink only accounts fora small fraction 
of the end-to-end path. 


We stress, however, that the point of our analysis is not 
to make general claims about certain ISPs being better or 
worse than others. Rather it is to show that geographic 
analysis of end-to-end paths yields interesting insights 


into the role played by multiple ISPs in specific contexts 
(e.g., academic sites) and that these insights are consis- 
tent with our intuition. 


5.4 Hot-potato versus Cold-potato routing 


Finally, we investigate whether geographic information 
can be helpful in assessing whether ISP routing policies 
in the Internet conforin to either hot-potato routing or 
cold-potato routing. In hot-potato routing, an ISP hands 
off traffic to a downstream ISP as quickly as it can. 
Cold-potato routing is the opposite of hot-potato rout- 
ing where an ISP carries traffic as far as possible on its 
own network before handing it off to a downstream ISP. 
These two policies reflect different priorities for the ISP. 
In the hot-potato case, the goal is to get rid of traffic as 
soon as possible so as to minimize the amount of work 
that the ISP’s network needs to do. In the cold-potato 
case, the goal is carry trattic on the ISP’s network to the 
extent possible so as to maximize the control that the 
ISP has on the end-to-end quality of service. In general, 
an ISP’s routing policy would lie somewhere in between 
the extremes of hot-potato and cold-potato routing. 


First (SP vs Second ISP 


Cumulative Distnbution 





Figure 14: CDF of the fraction of the end-to-end path 
that lies within the first and second ISP networks in 
sequence. 


We consider the set of paths from U.S. sources to 
TVHosts. For each path that traverses two or more major 
ISP's (with nationwide backbones), we compute the frac- 
tion of the end-to-end path that lies within the first major 
ISP (ISPI) and the second major ISP (ISP2) in sequence. 
We use these fractions as measures of the amount of 
work that these ISPs do in conveying packets end-to- 
end. The distributions of these fractions is plotted in Fig- 
ure 14. We observe that the fraction of the path that lies 
within the first ISP tends to be significantly smaller than 
that within the second ISP. For instance, the median ts 
0.22 for the first ISP and 0.64 for the second ISP. This is 
consistent with hot-potato routing behavior because the 
first ISP tends to hand off traffic quickly to the second 
ISP who carries it fora much greater distance. 
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Figure 14 also plots the distributions of the path lengths 
in the case where the first ISP is Sprintlink. We find 
that the difference between the ISP] and ISP2 curves 
is even greater inthis case. A gain, this is consistent with 
hot-potato routing behavior on the part of Sprintlink for 
routes from academic locations. 


5.5 Summary 


In this section, we have used geographic information to 
study various aspects of wide-area Internet paths that 
traverse multiple 1SPs. We found that end-to-end In- 
ternet paths tend to be more circuitous than intra-iSP 
paths, presumably because of the peering relationships 
between ISPs. Furthermore, paths that traverse substan- 
tial distances within two or more ISPs tend to be more 
circuitous than paths that largely traverse only a sin- 
gle ISP. Some of this circuitous routing behavior can 
be attributed to sub-optimal geographic peering between 
ISPs. Finally, the findings of our geography-based analy- 
SiS are consistent with the hypothesis that ISPs generally 
employ hot-potato routing. The presence of hot-potato 
routing may also explain for why some ma jor ISPs only 
account for a relatively small fraction of the end-to-end 
path. 


6 Geographic fault tolerance of ISPs 


An important component of studying Internet routing 
is to understand its fault tolerance aspects. Fault toler- 
ance of a network is normally studied at the granularity 
of router or link failures. However such a failure model 
does not capture the fact that two seemingly independent 
routers can be susceptible to correlated failures. 


We ask the question: what is the tolerance of an ISP’s 
network to a tota/ network failure in a geographic re- 
gion, i.e., a failure that affects all paths traversing the 
region? We refer to such a failure as a geographic fail- 
ure. Potential reasons for such a failure include natural 
calamities such as earthquakes or power blackouts. 


By using the geographic location information of the 
routers, we can identify routers that are co-located and 
thereby construct a geographic topology of an ISP. In 
this topology, each geographic region is assocated with 
a node and an edge between two nodes signifies the ex- 
istence of at least one long-haul backbone link that con 
nects the corresponding geographic re gions. 


We obtained the geographic topologies for 9 of the 13 
ma jor 1 SPs listed in Section 3.4.1] from the CAIDA Map- 
Net site [24]. These are: AT&T, Cable and Wireless, 
Sprintlink, Genuity, Qwest, PSINet, UUNet, Verio and 
Exodus. Many of these topologies are obtained from 
information published at the ISPs’ Web sites and are 
between 6-12 months out of date. Although it may be 
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possible to construct an ISP’s geographic topology us- 
ing extensive traceroute measurements, it would be hard 
to assess the completeness of the constructed topology. 
He nce we restrict ourselves to the geographic topologies 
obtained from CAIDA. However, as acknowledged by 
CAIDA [24], it is possible that these topologies may 
themselves be incomplete. This may be due to lim- 
ited tracing or the presence of backup paths in routing. 
We will perform our analysis under the assumption that 
these topologies are reasonably complete and only have 
a few missing links. 


6.1 Degree distributions 


The degree of a node provides a first-level quantification 
of the fault tolerance of that node in a given topology. 
A node with a degree k can tolerate up to k geographic 
failures before getting completely disconnected from all 
other nodes in the topology. In particular, a leaf node is 
not resilient to the geographic failure of its neighbor, but 
the failure of a leaf node itself has minimal impact on 
the rest of the network. On the other hand, the failure of 
a node with a very high degree would impact its many 
neighbors (corresponding to many different geographic 
re gi ons ). 


Given complete freedom in placing EF = k « N edges 
on N nodes, it is possible to construct a topology that 
has a minimum vertex-cut of 2’. In other words, the £ 
edges can be placed in such a way that even in the pres- 
ence of any 2k — 1 node failures in the graph, the result- 
ing topology will still remain connected. We term such a 
placement of edges that maximizes the size of the vertex 
cut as an optimal placement. In the optimal placement, 
all the vertices have the same degree, viz. 2 « &. For the 
simple case of & = 1, the optimal placement results in a 
ring topology. Although this optimal placement may be 
difficult to construct due to practical constraints, it pro- 
vides us a nice reference point for comparing the fault 
tolerance of ISP topologies. In order to contrast an ISP’s 
topology from the optimal scenario, we look at the de- 
gree distribution of the nodes. We say that a graph has 
a skewed degree distribution if its node degrees are dis- 
tributed over a wide range wth afew large node degrees 
and a high percentage of the nodes are leaves. The Inter- 
net topology exhibits a skewed degree distribution which 
can be characterized by a power law as described in [4]. 


Among the 9 commercial ISPs, some of them such as 
AT&T and Genuity have a very skewed degree distribu- 
tions while other ISPs such as PSiNet and Veno have 
much less skewed degree distributions (closer to opti- 
mal). The degree distribution will not be affected much 
due to a few missing links. Figure 15 shows the de- 
gree distributions of AT&T and PSINet. AT&T’s topol- 
ogy has the maximum percentage of leaves among the 9 
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Degree Distribution of ISP-Geographic Topology 
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Figure 15: Degree Distribution of Geographic 
Topologies of ISPs 


ISP topologies (62%) and has a few nodes with a degree 
greater than 12 (Chicago, Dallas). On the other hand, 
more than 50% of PSINet’s nodes have a degree of ei- 
ther 2 or 3. This matches the optimal degree for Verio 
given that it has an edge to node ratio & = 1.5, which 
corresponds to an optimal degree of 2 * & = 3. The 
ISP-Combine curve shows the degree distribution of the 
geographic topology obtained by combining the topol- 
ogy graphs of all 9 ISPs. The geographic nodes corre- 
sponding to the same city in the individual ISP topolo- 
gies map toa single node in the combined topology. The 
combined topology still has a significant skew in its de- 
gree distribution. 29% of the nodes continue to be leaves. 
This happens despite the combined topology having an 
edge to node ratio of 4: = 2.5, which corresponds to an 
optimal degree of 5. On the other hand, nodes located 
in the important networking hubs of U.S. (e.g, San Jose, 
Washington DC, Chicago) have a degree of more than 
20 in the combined topology. 


6.2 Failure of high connectivity nodes 


The skewed degree distributions of many tier-I ISPs 
indicate that many geographic regions of an ISP may 
get disconnected if some high connectivity geographic 
nodes fail. To evaluate this, we consider the failure sce- 
nario where the f nodes of highest degrees in a graph 
fail. 


We define a pair of geographic nodes that are connected 
by anetwork path and can communicate with each other 
as a communicating pair. A connected topology of NV 
nodes can support V(N + 1)/2 communicating pairs. 
(Since each node represents a geographic region, we also 
consider intra-node communication of a node with it- 
self.) Under the scenario where the f nodes of high- 
est degrees fail, the graph is disconnected into a forest 
where a node can only communicate with other nodes in 
its connected component. A connected component with 


m. < N nodes can support mm * (7 + 1) /2 communicat- 
ing pairs. In the simple case where the parent of a leaf 
node fails, it produces a connected component of size 1 
which supports exactly one communicating pair. 


Fauh Tolerance of ISP Topologies 
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Figure 16: Tolerance to Geographic Failures 


Figure 16 shows the percentage of communicating pairs 
supported in the various ISP networks in face of a vary- 
ing number of geographic failures. The combined topol- 
ogy of the 9 ISPs supports 68% of the communicating 
pairs even after the removal of 5 important networking 
hubs in the US (San Jose, New York, Washington DC, 
Chicago, Los Angeles). Among the 9 ISPs, while Genu- 
ity and PSINet exhibit the least and the best fault toler- 
ance characteristics. In the face of a single node failure, 
most of the ISPs lose between 15% and 30% of their 
communicating pairs in the worst case. 


It is important to note that these results may represent 
a near-worst case failure scenario for the ISPs. If, how- 
ever, many backup links are missing from our topology, 
the fraction of communicating pairs may be much higher 
than what we have portrayed. However, our essential 
message from this analysis is that a balanced degree dis- 
tribution is a good feature for building a fault tolerant 
topology for an ISP. 


7 Conclusions 


In this paper, we have presented geography as a means 
for analyzing various aspects of Internet routing. First, 
our analysis based on extensive traceroute data shows 
the existence of many circuitous routes in the Internet. 
From the end-to-end perspective, we observe that the 
circuitousness of routes depends on the geographic and 
network locations of the end-hosts. We also find that the 
minimum delay along a path is more strongly correlated 
with the linearized distance the path than it is with the 
geographic distance between the end-points. This sug- 
gests that the circuitousness of a path does impact its 
minimum delay characteristics, which 1s an important 
end-to-end perfonnance metric. In ongoing work, we are 
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studying the correlation between geography and network 
performance. 


Second, a more careful examination shows that many 
circuitous paths tend to traverse multiple major ISPs. 
Although many of these major ISPs have points of pres- 
ence in common locations, the peering between them is 
restricted to specific geographic locations, which causes 
the paths traversing multiple ISPs to be more circuitous. 
We also found that intra-ISP paths are far less circuitous 
than inter-ISP paths. An important requirement to reduce 
the circuitousness of paths is for ISPs to have peering re- 
lationships at many geographic locations. 


Third, the fraction of the end-to-end path that lies within 
an ISP’s network varies widely from one ISP to another. 
Furthermore, when we consider paths that traverse two 
or more major ISPs, we find that the path generally tra- 
verses a significantly shorter distance in the first ISP’s 
network than tn the second. This finding ts consistent 
with the hot-potato routing policy. Using geographic in- 
formation, we are able to quantify the degree to which 
an ISP’s routing policy resembles hot-potato routing. 


Finally, our analysis of geographic fault tolerance of 
ISPs indicates that the (IP-level) network topologies of 
many tier-] ISPs exhibit skewed degree distributions 
which may induce a low tolerance to the failure of a sin- 
gle, critical geographic node. The combined topology of 
multiple ISPs exhibits better fault tolerance characteris- 
tics, assuming that the ISPs peer at all geographic loca- 
tions that are in common. 
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Abstract 


It is desirable to hold network attackers ac- 
countable for their actions in both criminal 
investigations and information warfare situa- 
tions. Currently, attackers are able to hide 
their location effectively by creating a chain 
of connections through a series of hosts. This 
method is effective because current host audit 
systems do not maintain enough information 
to allow association of incoming and outgo- 
ing network connections. In this paper, we 
introduce an inexpensive method that allows 
both on-line and forensic matching of incom- 
ing and outgoing network trafic. Our method 
associates origin information with each pro- 
cess in the system process table, and en- 
hances the audit information by logging the 
origin and destination of network sockets. We 
present implementation results and show that 
our method can effectively record origin in- 
formation about the common cases of step- 
ping stone connections and denial of service 
zombies, and describe the limitations of our 
approach. 


1 Introduction 


As the Internet has become a widely ac- 
cepted part of the communications infrastruc- 
ture there has been an increase in the num- 
ber of network attacks [18]. One factor in 
the growth of attacks is that network attack- 
ers are only rarely caught and held account- 
able for their actions, giving them relative im- 
punity in action. This situation has arisen, 
in part, because of the relative ease that at- 
tackers have in hiding their location, making 
it difficult and expensive for investigators to 
determine the origin of an attack. 
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In general, attackers use two different 
methods to hide their location {16]. One 
method, common in denial-of-service attacks, 
is to spoof the source address in IP packet 
headers so that recipients cannot easily de- 
termine the true source. As discussed further 
below, this has been an area of significant re- 
search in recent years. The other method, 
which has received significantly less attention 
from the research community, is for attackers 
to sequentially log into a number of (typically 
compromised) hosts. These forwarding hosts, 
often called stepping-stone hosts [40], effec- 
tively disguise the origin of the connection, 
as each host on the path sees only the previ- 
ous host on the connection chain. A victim of 
an attack would not be able to determine the 
source of an attack without tracing the path 
back through all intermediate stepping-stone 
hosts. The audit data currently maintained 
at hosts is generally insufficient to correlate 
incoming and outgoing network traffic, so re- 
search about this problem has concentrated 
only on what can be deduced from network- 
level data. However, streams can be modified 
or delayed in a host so that a correlation is 
no longer possible from a network-level point 
of view, necessitating a host-based solution. 


In this paper we will discuss a simple and 
inexpensive method for maintaining the nec- 
essary information to correlate data entering 
a host with data leaving a host. The goal of 
this work is to provide additional audit data 
that can help determine the source of net- 
work attacks. We include results from an im- 
plementation for the FreeBSD 4.1 kernel that 
show the technique is effective in providing 
information useful in tracing common attack 
situations, particularly for tracing stepping 
stones and denial-of-service attack zombies. 
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The next section provides a complete back- 
ground of related work in the area and pro- 
vides our view of the problem and design 
criteria to be addressed in providing a solu- 
tion. Section 3 describes the technique we 
use to obtain and maintain location informa- 
tion for each process, and the logging mech- 
anisms that can be used to provide forensic 
access to the data. Section 4 describes the 
specific application of the technique to the 
FreeBSD 4.1 kernel, and is followed in Sec- 
tion 5 by examples of the implementation in 
action. Section 6 outlines the limitations of 
our approach, and how these limitations can 
be addressed in future work. Finally, Sec- 
tion 7 provides a summary of our work. 


2 Background 


The goal of network traceback research is to 
allow determination of the source of attack 
traffic, so that a particular host used by a 
human to initiate an attack can be identified, 
and real-world investigative techniques used 
to locate the person responsible. 


In order to accomplish this, the two prob- 
lems described above — locating the source 
of IP packets and determining the first node 
of a connection chain — need to be solved. As 
described below, there has been significant re- 
search in locating the source of IP packets, 
and there have been efforts made to iden- 
tify connection chains sources by examining 
network traffic. What is lacking is a reliable 
method of correlating incoming network traf- 
fic to a host with outgoing network traffic em- 
anating from the host. This paper presents a 
mechanism for doing this. While our method 
is not always reliable, as discussed in Sec- 
tion 6, we believe that with further research 
and community involvement, this work can 
help address what is a serious problem. 


2.1 Packet Source Determination 


In normal] operation, a host receiving pack- 
ets can determine their source by direct ex- 
amination of the source address field in the IP 
packet header. Unfortunately, this address is 
easy to falsify, making it simple for attack- 
ers to send packets that have their source ef- 
fectively hidden. This is more common for 
one-way communication, such as the UDP 
and ICMP packets used in denial-of-service 
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attacks, but has been of use in attacks us- 
ing TCP streams [21, 2]. There has been 
significant recent research in how to locate 
the source of such packets, primarily moti- 
vated by distributed denial-of-service (DDoS) 
attacks in early February of 2000. While 
it is generally recommended that routers be 
configured to perform ingress or egress rout- 
ing [11], it is clear from continuing denial- 
of-service attacks {20} that this is not widely 
done. There have been other methods pro- 
posed to perform filtering to limit the effect 
of such attacks [24, 14]. 


As it is currently not possible to prevent 
such attacks, recent work has focused on how 
to locate the source of attacks. Some meth- 
ods add or collect information at routers to 
allow traceback of DoS traffic [6, 27, 35, 7, 30]. 
Other methods add markings to the pack- 
ets to probabilistically allow determination of 
the source given sufficient packets [28, 31, 23, 
8, 9], or forward copies of packets, encapsu- 
lated in ICMP control messages, directly to 
the destination [1, 37]. A more innovative 
method uses counter-DoS attacks to locate 
the source of on-going attacks [4]. While we 
do not require that these schemes be avail- 
able, we can make effective use of the trace- 
back information they provide. 


2.2 Correlating Streams 


Research addressing determination of the 
source of a connection chain has mainly fo- 
cused on correlating streams of TCP connec- 
tions observed at different points in the net- 
work. Figure 1 shows an example of a con- 
nection chain. 


The initial work in matching streams con- 
structed thumbprints of each stream based on 
content [32]. While this technique could ef- 
fectively match streams, it would be ineffec- 
tive in compressed or encrypted streams as 
are common today. Other work compared 
the rate of sequence number increase in TCP 
streams as a matching mechanism, which can 
work as long as the data is not compressed at 
different hops and does not see excessive net- 
work delay {38]. Another technique, which 
relies solely on the timing of packets in a 
stream, is effective against encrypted or com- 
pressed streams of interactive user data [40]. 
This work was originally intended for intru- 
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Figure 1: A sample connection chain 


sion detection purposes but was also pro- 
posed as an effective method for finding the 
source of connection chains. While perform- 
ing stream matching might be effective in 
some cases, such methods rely on examin- 
ing network information, and might be vul- 
nerable to the same methods that can be 
used to defeat network intrusion detection 
systems [26]. 


2.3 Forensic System Analysis 


One of the objectives of computer foren- 
sics is the reconstruction of events that oc- 
curred on a system. Tools like the Coroner’s 
Toolkit [10] attempt to discover hidden and 
deleted files and use access times to deduce 
system activities. A more formal model for 
file and event reconstruction is given by An- 
drew Gross [13]. 


In order to solve the host causality prob- 
lem using forensic tools, it is necessary that 
network traffic is logged in some fashion on 
the host. Usually the essential information 
needed to associate incoming and outgoing 
traffic is not provided as the default on a 
system. While tools like TCP wrappers doa 
fine job logging incoming network traffic for 
the essential services, usually outgoing traf- 
fic is not logged at all. Furthermore, at best 
network activity can be tied to a particular 
user on the system. Exactly which processes 
and programs are involved may be obscured 
if there is a large amount of activity by that 
user. 


2.4 Host causality 


Though certain aspects of the network 
traceback problem have been addressed by 
the approaches described above, a new area 
of research that is concerned with data trans- 
formations or data flow tracking through a 
host is needed for a complete picture for at- 
tack origin traceback. We call this new area 
host causality, because we are attempting to 


determine what network input causes other 
network output. 


Common operating systems do not cur- 
rently provide information that can match in- 
coming and outgoing network traffic. While 
there has been some work that attempts to 
use existing system information to match ac- 
tive incoming and outgoing streams [15, 5], 
this work has been either shown to be imprac- 
tical to securely implement [3], or requires 
an external trigger to store forensic informa- 
tion. Ideally, it should be possible to deter- 
mine whether network traffic was originated 
directly from a particular host, or occurred as 
a result of a connection from some other re- 
mote machine, and, if possible, which remote 
machine is involved. This would not only help 
in tracing back to the source of a network at- 
tack, but could be useful in showing due dili- 
gence, so that the owner of a machine used 
in attack could demonstrate that the attack 
originated elsewhere. 


A solution that addresses the problem of 
tracing connections through a host is neces- 
sary because a host on the network can trans- 
form data passing through it in such a way 
that, from the network’s point of view, it can 
no longer be easily related to trafic leaving 
it. This might be the case in a stepping stone 
scenario, if the traffic is delayed, or differently 
compressed or encrypted. Also, in attacks 
like a distributed denial-of-service (DDoS) at- 
tack [39], control traffic cannot be linked to 
the resulting attack traffic. In such an attack, 
packet source-location techniques might iden- 
tify the source of a particular attack stream, 
but will not allow identification of the master 
or the controlling host. This is due to the fact 
that the datagrams that are used to perform 
the attack are seemingly unrelated to those 
that control the client. What is missing is in- 
formation within the host that can be used 
to associate an incoming control packet with 
outgoing attack data. 
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2.4.1 Desired Properties 


The following properties either need to be ful- 
filled or seem desirable in order to achieve a 
practical solution to the host causality prob- 
lem: 


1. It must be possible to determine whether 
a given process on the host was started 
by a local user or remotely. 


2. If a process was started by a user at a 
remote location, information about that 
source must be maintained and associ- 
ated with the process. 


3. An audit facility must exist that allows 
the logging of incoming network traffic 
and processes that receive it. This will 
allow correlation between the source of a 
process and the source of incoming net- 
work packets. 


4. An audit facility must exist that allows 
the logging of individual outgoing net- 
work traffic and processes that send it. 
Combined with the facility above, one 
could then relate incoming and outgoing 
traffic processed by the same process. 


5. The logs maintained about origin infor- 
mation should be resistant to modifica- 
tions by attackers. 


6. Processes that spawn other processes 
need to pass on their source information 
to their children, or, if they provide a re- 
mote login service, pass on the remote 
location as the child’s new source. 


7. The modifications to a system should be 
minimal so that they do not interfere 
with existing software. 


8. Due to restricted logging space, it should 
be possible to use rules to control what 
data the audit system collects. 


9. It should be possible to quickly identify 
processes that were not started locally 
together with their remote location. 


3 Description of Model 


A process on a computing host is an exe- 
cuting instance of a program [34]. Processes 
are therefore, among many other things, re- 
sponsible for receiving and generating net- 
work data on a host that is connected to a 
network. 


Processes can be started: 
e explicitly by a human being 
e by the system 
A human being can start processes: 


e while physically present at the host 
e from a remote location 


e indirectly through some other process he 
or she started 


The system can start processes: 
e through startup scripts (including init) 


e through scheduling services like cron 
and at 


e through system services like inetd 


The origin of a process is the information 
about how any process running on the system 
was started in regard to the above possibili- 
ties. For the purpose of this paper only a dis- 
tinction between a process that was started 
by a human being from a remote location (re- 
mote origin) and the other ways (local origin) 
is of importance, with the exception of the 
special case of indirectly started processes. 


In case of a remote origin for a process, 
the origin information should include that re- 
mote location. If the system tracks the origin 
of a process and a process sends out network 
traffic and is of remote origin, then the sys- 
tem can make a connection between the traf- 
fic that was sent out and the traffic that was 
received from the origin of the process over 
the network. The traffic could be individual 
datagrams, or they can be part of an estab- 
lished connection. 


In order to gain access to a system from 
a remote location and start new processes 
there, a user has to make use of a service of- 
fered on that particular system. Usually most 
systems provide well-known services such as 
telnet, rsh, or ssh that will give a remote 
user a shell on the system. However, there 
are other possibilities to create new processes 
that do not involve an interactive shell. In 
fact, any process listening on an open port 
on the system may be used or misused for 
such purposes. Our solution does not address 
these problems, and they are a topic for fu- 
ture investigation. 
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As the only legitimate remote access to 
a system is through its well-known services, 
it is feasible to store information about the 
existing connection with the newly created 
child process. After a successful login pro- 
cedure, the source of the new process should 
reflect the information stored about the con- 
nection. Note that the origin of a process 
and its subsequent children is set at the time 
a user gains access to the system. All pro- 
grams that will be started during that remote 
session will inherit that origin. At this point 
time delays become irrelevant, as origin infor- 
mation is stored with the processes no matter 
whether or not processes become dormant for 
any amount of time. 


3.1 Information Storage 


From the viewpoint of a host, all that can 
be deduced about the origin of an arriving 
network packet is the interface that it arrived 
on and the information that is contained in 
the packet itself. A host on its own cannot 
determine whether a network datagram was 
spoofed or not. ‘Therefore, for IP packets, 
the five-tuple consisting of source and desti- 
nation IP addresses and source and destina- 
tion ports and the protocol number must suf- 
fice to distinguish source information main- 
tained about processes on a host. If packet 
traceback schemes are deployed and can pro- 
vide additional information, it is possible to 
maintain that information as well. 


While storing information about active 
processes can be useful, for complete analysis 
of attacks, some additional information needs 
to be logged as well. The logging mechanism 
Can maintain more explicit information than 
simply storing the IP five-tuples. Along with 
the five-tuple and timestamp, the system can 
also store the interface on which the packet 
arrives and the process id. If the system logs 
individual packets, it can also store a check- 
sum of the non-changing parts of each packet 
header that is logged in case the need for a 
more detailed post-analysis matching arises. 


Furthermore, it would be expensive and 
impractical to log an entire stream of packets 
that make up the entirety of a TCP stream. 
Since TCP is a connection oriented trans- 
port layer protocol, it is sufficient to only 
regard incoming and outgoing SYN requests 


for the purposes of logging. Unfortunately, 
UDP is a connection-less protocol. Thus for 
UDP, all packets need to be logged. Log- 
reducing mechanisms that group the same 
kind of UDP packets together can certainly 
be applied here, but this is out of the scope 
of this paper. 


3.2 Limitations on Information 


Availability 


For well-known services we can assume that 
there will only be one open network connec- 
tion for each child process spawned as they 
adhere to the common style of running Unix 
servers that fork for each new request [33]. 
Non-standard server programs might behave 
differently, however, and there might be mul- 
tiple open connections when we try to de- 
termine the origin information. In this case, 
it 1s impossible to be sure which connection 
should be considered as the origin of the pro- 
cess. Because of this, there can be a problem 
with using the latest data from the accept 
system call as the origin information. If a 
server program allows multiple open sockets 
before the call to login, then there is a pos- 
sibility that the wrong origin information is 
stored with the process. It is possible to de- 
sign a program that after accepting a con- 
nection opens another listening socket to re- 
ceive a decoy connection from a completely 
different remote site or, even worse, from the 
local host itself. This would set the infor- 
mation obtained from the accept call to the 
new socket’s source data, before login was in- 
voked. After a successful login procedure, the 
origin information would be incorrect. If a 
local user installs such a program, then any 
attacks originating from it can be viewed as 
originating from the host, which is consistent 
with our definition of local origin. 


Another problem is that a remote user may 
still hide his real origin by creating a connec- 
tion from the system to itself. In this case the 
origin information of the process gets changed 
to the source information of the local host. 
While the process is still being considered of 
remote origin, it is of no value from a trace- 
back perspective. If many remote processes 
“change” their origin in such a fashion, one 
cannot determine anymore what the “real” 
origin of any of those was. In order to pre- 
vent this obscuring of the origin of a process, 


General Track: 2002 USENIX Annual Technical Conference 


265 





266 


one needs to keep track of an inheritance line 
for remote processes. That is, for any given 
process of remote origin, one must be able 
to determine its parent process if that parent 
process also was of remote origin. 


4 Implementation 


The model described above was imple- 
mented in the FreeBSD 4.1 operating system 
on an i386 based PC. While the implemen- 
tation is therefore specific to the UNIX op- 
erating system, the general principles of the 
model should be applicable to other systems 
as well. 


All processes that accept network connec- 
tions do need to make use of the socket sys- 
tem calls provided by the system. Stevens 
(33] describes the necessary steps to set up 
a TCP or UDP server. They involve sys- 
tem calls to bind, listen, and accept, in 
that order. Thus any connection between two 
systems must have successfully undergone a 
call to accept on the server side. In the 
case of TCP, accept returns after a success- 
ful three-way handshake. In the case of UDP, 
accept returns upon reception of a packet 
that matches the socket characteristics. 


As a successful connection implies a suc- 
cessful return from the accept system call, 
it seems reasonable to make modifications 
there in order to obtain location information. 
Specifically, with the assumption of only one 
open network connection, it is sufficient to 
record the data from the last call to accept. 
This information will then be accessible to the 
child process created by the fork system call. 
Finally, after a successful login procedure, the 
source of the new process should reflect the 
information stored about the connection. As 
the login program lies in user space and not 
all well-known servers utilize it, it will be nec- 
essary to perform this step through one of the 
system calls such as setlogin. 


All the necessary information described in 
Section 3 is available within data structures 
used by the accept system call. Once the 
connection has been established, the socket 
descn ptor contains the source IP address and 
the source port of the purported source of 
the trafic. To determine the destination IP 
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address and port that was used to establish 
the connection, the system also has to access 
the protocol control block (PCB) that is as- 
sociated with the socket and that is pointed 
to from the socket data structure’. The in- 
formation can be obtained through simple 
pointer lookups. 


4.1 Where to store source infor- 
mation 


We decided to maintain the information di- 
rectly in the process table itself, because it 
is simple to add another field that contains 
the necessary information, and creation and 
termination of processes is handled auto mat- 
ically. The inheritance problem is taken care 
of as well, as the fork system call causes cer- 
tain fields of the process table to be copied to 
the child. The only time we therefore need to 
access the field in the process table is when 
origin information changes. The disadvan- 
tage of this approach is that some auxiliary 
programs such as top and ps might have to 
be adjusted to accommodate the changes. 


It is possible to utilize existing logging fa- 
cilities, such as syslog to record the data, or 
a logging program can develop its own format 
and location to store the information [25]. 
Ideally, there would be some mechanism to 
ensure the integrity of the logs. Write-once, 
read-many media, or a secure logging facdlity 
could be used [29]. 


4.2 Data structures and kernel 
modifications 


For the source information, a new data 
type, struct porigin, was created as shown 
in Figure 2. 


The type field denotes whether the source 
is local (0) or remote (1). If the type is 0, 
all other fields are undefined and can be ig- 
nored. The next five fields are the typical 
four-tuple for a TCP or UDP connection, con- 
sisting of source and destination IP addresses 
as well as source and destination ports, plus 
the protocol number. The last parameter is a 
timestamp, which denotes the time the con- 
nection was established in network time for- 
mat {19]. Note that the network interface is 


1See McKusick et al. [17] or Wright and Stevens 
(36] for further details. 
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struct porigin { 
char type; 
struct in_addr source_ip, 
struct inwaddr dst_ip; 


u_short source_port; 
u_short dst_port; 
ucshort proto; 
time_t tstamp; 


Figure 2: The process origin data structure 


not included here but can be obtained with 
the information stored if necessary. 


In order to keep track of the correspond- 
ing source information for each process, the 
process table data structure (struct proc) 
was modified in two locations. It is neces- 
sary to retain the actual source information 
as well as information about the last accepted 
connection of a process. The latter is needed 
because all common TCP/IP based network 
services that provide a remote logjn facility 
first accept the connection and then fork off 
a child process where login is called. 


Hence, two fields, origin and lastaccept 
were added to the process table structure, 
both of type struct porigin. The fields are 
located in the area that gets copied in the 
fork system call. 


The copying of the origin field provides 
a simple and elegant solution for the inheri- 
tance mechanism. All it takes is a few more 
bytes to be copied in the fork system call, as 
the process structure is copied anyway. Thus, 
a child process always inherits the source in- 
formation from its parent. 


This leaves the question of where the two 
fields, lastaccept and origin are to be set. 
As the name already suggests, lastaccept 
is set in the accept system call, after a 
successful accept of an incoming connection. 
The modified accept system call was imple- 
mented as shown in Figure 3, which shows 
how to retrieve information from the PCB. 


Note that accepti is called from the ac- 
tual accept system call. The connection will 
be accepted in the procedure soaccept. If 
the call is successful, the type is set to 1, and 
the four-tuple is obtained from the PCB as- 
sociated with the socket via the pointer inp. 


Note that this will only work for a TCP con- 
nection, which is used by services which pro- 
vide a shell. For future work, other protocol 
types need to be considered. For instance, in 
the case of UDP, the recvfrom system call 
may be modified in a similar fashion. 


The origin fidd will be copied from a par- 
ent process to its child. However, as discussed 
above, each time a login is performed within 
a process, the source information of the last 
accepted socket should become the new ori- 
gin information for that process. Thus, at 
an invocation of login, the lastaccept field 
should be copied into the origin field. How- 
ever, as discussed above, login is only a pro- 
gram in user space that simply utilizes sev- 
eral system calls to perform the actual user 
login. One could supply a separate system 
call to have the lastaccept field copied to 
the origin field, but that would imply that 
every program that supplies a login service 
to be changed and use it. Therefore, one of 
the system calls used by every login service, 
setlogin was modified so that the field is 
copied after a successful call. 


To keep track of the inheritance line for a 
remote process, it is necessary to modify the 
fork system call, as well. It is sufficient to 
record the process IDs of the parent and child 
processes in case the parent is of remote ori- 
gin. From this information, it is possible to 
reconstruct the entire inheritance line for a re- 
mote process up to the first parent that was of 
remote origin. The syslog facility provides 
an easy way to log kernel messages, and was 
chosen to record the information out of rea- 
sons of simplicity. Figure 4 shows the modi- 
fications made to fork. In future work, this 
recording mechanism needs to be refined and 
optimized. 
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acceptl(p, uap, compat) 


zs 


(void) soaccept(so, ksa) ; 
inp = sotoinpcb(so); // pointer to protocol control block 
populate fields in p->lastaccept from information 
pointed to by inp; 
p->lastaccept.type = 1; 
Z 


Figure 3: The modified accept system call (pseudo-code) 


4.3 System calls 


In order to access the source informa- 
tion for a given process, a new system call, 
getorigin was added to the system. It 
takes as parameters a process identifier and 
a buffer, into which the source information is 
copied. 


Note that there is no system call to set or 
reset the origin field. With the getorigin 
system call, it is now possible to design log- 
ging facilities and administrative programs 
within user space that make use of the source 
information of a process. For reasons of sim- 
plidty, the call was implemented to be unre 
stricted. 


Another system call, portpid, was added 
to give support for the logging facility de- 
scribed below. If one wants to associate in- 
coming TCP or UDP packets with the receiv- 
ing process, one needs to find the process id of 
the socket that will handle an IP packet. The 
same is true for sockets that are responsible 
for outgoing packets. Those sockets are iden- 
tified in the network layer by the four-tuple 
of source and destination addresses and ports, 
but, unfortunately, there is no mechanism in 
FreeBSD to obtain that information within 
user space Thus the system call portpid 
will take such a four-tuple as well as a proto- 
col identifier (TCP or UDP) and will return 
the process id of the process that belongs to 
the listening socket that will accept packets 
matching the four-tuple, or belonging to the 
socket that sent the packet. If there is no such 
socket, an error will be returned. A weakness 
of this design is that a process may exit and 
be removed from the process table before the 
portpid call occurs. More verbose logging 
could offset this problem. 
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The FreeBSD operating system uses proto- 
col control blocks (PCBs) to demultiplex im 
coming IP packets. The PCBs are chained 
together in a linked list and contain IP source 
and destination addresses and TCP or UDP 
ports or wildcard entries for incoming packets 
to match against. Each PCB also contains a 
pointer to the socket that is destined to re- 
ceive a packet, should it match the four-tuple 
specified in the PCB. From the socket, one 
can then look in the receive or write buffer to 
obtain the actual process id of the receiving 
or sending process, respectively. In order to 
determine which process will receive a packet 
or which process sent a packet, one needs to 
traverse the list of PCBs until the best match 
is found, and then obtain the process id of the 
socket associated with the PCB. 


4.4 Logging facility 


The logging fadlity that was implemented 
is merely a proof of concept, and there are 
many feasible ways to design and implement 
one. Our implementation of the logging facil 
ity uses the libpcap library, which is part of 
the Berkeley Packet Filter (BPF). The BPF 
will make a copy of each incoming and outgo- 
ing network packet that matches given filter 
criteria and supply that copy to the process 
utilizing the filter. 


This prototype logging faality can there 
fore be considered as a network sniffer, but 
a more robust and efficient implementation 
would be one that is part of the kernel it- 
self. For each TCP SYN or UDP packet seen 
by the sniffer, the portpid system call is in- 
voked to obtain the process id of the process 
responsible for the packet. Once the process 
id is obtained, getorigin is called for that 
process id to determine whether the process 
is of remote origin or not. If it is of remote 
oligin, then the packet as well as the origin 
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int 
fork(p, uap) { 


error = forkl(p, RFFDG | RFPROC, &p2); 
if (error == 0) { 
p->p_retval[(0] = p2->p_pid; 
p~>p_retval[1] = 0; 
if (p->origin.type) 
log (LOG_INFO, 


"remote process %d spawned child 4d\n", 


p->p.pid, p2->p_pid) ; 


Figure 4: The modified fork system call 


information is printed out. Figure 5(a) shows 
the interaction of the different parts of the 
system with the logging facility, and Figure 6 
shows the important parts of the routine that 
processes the packets passed on by the BPF. 


There is a problem with logging outgoing 
UDP packets) The portpid system call re- 
lies on the socket that sent the packet to be 
still open so that it can find it in the PCB 
list. If an application opened a socket, wrote 
one UDP packet, and immediately closed the 
socket again, there is a chance that the socket 
no longer exists when the packet is examined 
by the logging facility. DDoS clients usually 
keep the socket they send packets from open 
so that packets can be sent at a faster rate, 
but for outgoing control packets, this is a 
problem. For TCP, this is not a serious is- 
sue, as there is either a three-way handshake 
or a time-wait period at the end of each con- 
nection. 


One method to solve the outgoing UDP 
packet problem could entail further modifica- 
tion of the kemel, keeping the process ids of 
sending processes in a cache and making that 
information available to the portpid system 
call. A similar approach could also improve 
lookup performance for incoming packets. In- 
stead of duplicating the de-multiplexing effort 
made in the networking stack, modifications 
to the stack could result in a new data struc- 
ture that returns the correct process id for a 
given fivetuple 


5 Implementation Results 


The modified kemel was installed on an In- 
tel Pentium III 866 MHz Celeron PC. The 
machine used is part of a small networking 


lab. We will discuss the effects of the changes 
on the normal system behavior as wd as give 
two examples of processes of remote origin 
handling traffic. 


5.1 Effects on normal system be- 
havior 


As the changes to the system were only 
few and cheap, the impact on the system is 
minimal. The getorigin copies a few bytes 
from the process table and is only executed 
for TCP SYN and UDP packets. For those 
packets, the call to portpid causes a linked- 
list traversal of the protocol control blocks in 
the same manner the networking stack does 
its de-multiplexing. In every call to accept, 
the lastaccept field is set from the socket 
information. These operations are very few 
and inexpensive compared to the entire set 
of operations within accept. In every call to 
fork, an extra few bytes need to be copied 
to pass on the origin information to a pro- 
cess’s child. The way the syslog facility was 
used to keep record of an inheritance line is 
very inefficient. On a system where processes 
spawn many children, the logs may quickly 
wrap around. That and the fact that the in- 
heritance line needs to be reconstructed man- 
ually from the logs suggests the need for a re- 
design of the inheritance line for future work. 


5.2 Examples 


5.2.1 Stepping Stone 


In this example, bliss was used as a stepping 
stone. A user from evil (10.0.0.1) logged 
into bliss (192.168.0.1) via ssh. From 
there, he used ssh again, to log into final 
(172.16.0.1). The actual host names and 
IP addresses have been replaced by fictitious 
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Figure 5: The logging facility and DDoS attack experimental] setup 


ones. This setup is equal to the example given 
in Figure 1 with the exception of the very last 
host. 


The logging facility recorded the following 
entry from this: 


192.168.0.1:1022->172.16.0.1:22 sent by pid 285 
Origin: 10.0.0.1:1022-192.168.0.1:22 


One can observe, that the origin informa- 
tion indicates the connection from evil, port 
1022, to bliss, port 22 (ssh). The logging 
mechanism didn’t log the connection from 
evil to bliss, as sshd is a local process. 
However, evil is clearly shown as the ori- 
gin for the process that connected to final. 
Therefore one can now associate the stream 
from bliss to final to the one from evil to 
bliss for traceback purposes. 


5.2.2 DDoS Client 


In this example, a DDoS trinoo client, ob- 
tained from the Packet Storm archive [22], 
was installed on bliss from evil. The cor- 
responding master was installed on another 
machine, master (192.168.0.2). Bliss was 
then used via master to perform a denial of 
service attack against victim (172.16.0. 2), 
a third machine in the test network. Fig- 
ure 5(b) shows the setup for the attack. 
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Again, host names and IP addresses have 
been changed. 


A sample of the logging output is presented 
in Figure 7. 


The first logged event is a UDP packet from 
bliss to master, notifying the trinoo mas- 
ter that a client is active. The next event is 
then a UDP packet from master to bliss, 
triggering the DoS attack. The rest of the 
log shows UDP packets sent from bliss to 
victim as part of the attack. 


All the traffic can be unambiguously associ- 
ated with the process 3760, the DDoS client. 
From the origin, one can see that the process 
was started from evil. In this example, it 
is clear that the attack was controlled from 
master. This might not always be possible, 
as multiple packets from different locations 
could be received by the process just before 
an attack. However, by examining the logs 
a good estimate might be derived. At the 
very least it will give a list of possible hosts 
from where the attack was launched. Net- 
work traceback mechanisms can now be used 
to determine the location from where the soft- 
ware was set up and master could now be 
investigated in the same manner as bliss to 
determine more information about the attack 
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if (protocol is TCP) { 
set pointer to TCP header within the packet; 
remember source and destination ports; 
if (this is the start of a new connection) 
set log flag; 
} 
else if (protocol is UDP) { 
set pointer to UDP header within the packet; 
remember source and destination ports; 
set log flag; 
} 
if (log flag is set) { 
if (packet is coming in) 


invoke portpid with parameters for incoming packets; 


else if (packet is going out) 


invoke portpid with parameters for outgoing packets; 


else 
set error; 


if (portpid returned successfully) { 
call getorigin with pid returned by portpid; 
if (origin is remote) { 
if (packet is coming in) 
print log for incoming packets; 
else if (packet is going out) 
print log for outgoing packets; 


Figure 6: The packet processing routine of the logging facility (pseudo -code) 


192.168.0.1:1117->192.168.0.2:31335 (17) sent by pid 3760 


Origin: 10.0.0.1:32155-192.168.0.1:13419 


192.168.0.2:39805->192.168.0.1:27444 (17) received by pid 3760 


Origin: 10.0.0.1:32155-192. 168 .0.1:13419 


192.168.0. 1:1135->172.16.0.2:12865 (17) sent by pid 3760 


Origin: 10.0.0.1:32155-192.168.0.1:13419 


192.168 .0.1:1135->172.16.0.2:59850 (17) sent by pid 3760 


Origin: 10.0.0.1:32155~-192.168.0.1: 13419 


192.168.0.1:1135->172.16.0.2:10435 (17) sent by pid 3760 


Origin: 10.0.0.1:32155-192.168 .0.1:13419 


192.168.0.1:1135->172.16.0.2:4577 (17) sent by pid 3760 


Origin: 10.0.0.1:32155-192.168 .0.1: 13419 


Figure 7: Output of the logging faality 


and the location of the attacker. 


6 Limitations and Future Work 


This paper presents a first attempt at a 
mechanism designed to address the problem 
of determining host causality. While it is 
progress in a forward direction, it is not a 
complete solution to a problem, though its 
use could prove beneficial in many cases. We 
hope that discussion of the limitations will 
foster other research on the problem. 


While available ori gin information is main- 
tained for processes that utilize setlogin, 
there are other mechanisms that attackers 
can use to start processes on a system. Re 
motely, attackers might gain access to a sys- 
tem using processes that service network re- 
quests, such as mail, web, or ftp servers. Ex- 
ploits such as buffer overflows against these 
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processes can produce user shells for the at- 
tacker, bypassing the system call. In these 
cases, origin information will not be properly 
recorded. For these cases the question arises 
when exactly to set the origin information so 
that it is meaningful. Furthermore, an at- 
tacker who gains access to a system might 
use a cron or at job to create a process after 
the attacker has logged off; this would also 
result in processes that lack the correct ori- 
gin information. A solution to this problem 
might be to include origin information in the 
file system so that when the new process was 
started the appropriate location information 
was available. 


Sometimes login servers can open a second 
connection to the client for out-of-band data. 
Currently this scenario is not handled in the 
design. However, its seems that in the worst 
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case the wrong port is recorded for the origin 
within the modified accept system call. 


An attacker also might use a covert chan- 
nel between processes to obscure the proper 
location information. In this scenario, an at- 
tacker, who perhaps enters the system though 
a mechanism that invokes setlogin and 
whose processes therefore have correct origin 
information, uses some form of IPC to cause 
a process that has other origin information 
to send data into the network. This is a dif- 
ficult problem to deal with, as it has always 
been [12], and we do not have an immediate 
solution for it. Any process that listens on a 
covert channel needs to have been started ei- 
ther locally or remotely, however, and in the 
case of an external attacker, most likely re- 
motely. Thus, any outgoing traffic from that 
process will still be logged. 


While our implementation only operates on 
TCP and UDP packets, any protocol could 
be used by an attacker. For example, some 
DDoS tools use ICMP messages to send con- 
trol messages over the network. In this case, 
an attacker would either have to modify the 
routines for ICMP processing in the kemel or 
may have to sniff the incoming traffic using a 
library like libpcap. If the attacker has mod- 
ified the kernel to listen to and process these 
messages, there seems to be little that can be 
done to establish the origin information for a 
process, because if the kernel can been mod- 
ified by the attacker, the ongin information 
can be tampered with as well. In the latter 
case one can check for open BPF filters and 
also be aware of processes that utilize other 
protocols or do not receive network packets 
from the networking stack but rather through 
the packet filter. 


The mechanism for keeping track of the in- 
heritance line for a process needs to be im- 
proved. The current mechanism, while very 
simple to implement, is the only part of the 
modifications we made that affects the sys- 
tem in a noticeable fashion. One problem is 
that with each new child process, more in- 
formation needs to be stored, even though 
it is small. Once a separate data structure 
for keeping inheritance lines is used, a simple 
improvement would be to delete inheritance 
lines or parts of it where all the processes 
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involved have terminated. However, overall 
management of the inheritance lines remains 
as future work. 


In the event of a system compromise, in 
which an attacker gains root capabilities, the 
origin information in the kemel and recorded 
information in the file system is just as vul- 
nerable to modification or deletion as any. 
other kermel or file system information. We 
consider this outside of the scope of our 
work, but point to other work that attempts 
to make audit information survive such at- 
tacks [29], and suggest that current forensics 
tools could be modified to recover the altered 
Ongin information in some cases. 


Finally, as mentioned above, the packet 
logging system is a prototype only; a more 
effective design would be to include the log- 
ging mechanism in the kemel itself. Instead 
of sniffing for outgoing packets, writes to 
a network socket would cause the outgoing 
packet to be logged before the socket could 
be closed, alleviating the problem with trying 
to find the source of UDP packets mentioned 
above. Additionally, the current mechanism 
logs all TCP SYN and UDP packets, creating 
a denial-of-service opportunity for attackers 
to fill up disk space, so a more selective ap- 
proach to recording packetsis clearly in order, 
where possible. 


6.1 Future work in Host Causality 


Even though the origin information was de 
signed with network traceback in mind, there 
are other applications or foundations for new 
modifications of the system: 


e A system administrator can use the on- 
gin information to determine the on- 
gins of all running processes and iden- 
tify ones that have a very unusual source. 
This can lead to the discovery of running 
DDoS clients on a machine, for example. 


e The origin information can be incorpo- 
rated into the file system. By storing a 
process’s origin information with a file 
whenever the process writes to the file 
system. Not only can this help in solv- 
ing the problem with cron and startup 
scripts, but it can also aid in locating 
suspicious programs in users’ home di 
rectories. This would be especially ef- 
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fective with logging file systems, so that 
the changes in files could be tracked by 
location as well. 


e Origin information adds another dimen- 
sion to access control. Access control 
mechanisms can be altered so that they 
take origin information into account and 
grant certain privileges only when cer- 
tain origin conditions are met. 


e Statistics based on origin of processes 
can be gathered, which can be used to 
profile normal system behavior or to lo- 
cate trends that may help in better sys- 
tem administration. 


Origin information may well benefit in 
other security related fields.) The prospect 
of access control in combination with origin 
information seems to be an especially inter- 
esting area. Research in that direction may 
well improve overall robustness of the origin 
mechanism itself. 


7 Conclusion 


In this paper, we have introduced the no- 
tion of host causality as a mechanism to co m- 
plement current research in network trace 
back. With the addition of origin information 
to a process, we have developed a mechanism 
that, with only minor changes to the given 
system, works well under the simple circum- 
stances. The two examples show that impor- 
tant information for network traceback can be 
obtained with origin information and the new 
logging possibilities that result from that. 


The work we presented here is only the 
start of work in the overall area. We have 
identified many limitations of our mechanism, 
and outlined what future work needs to be 
done to better address the problem. Host 
causality is not a complete solution to all the 
problems that faced in tracing connections 
through a network, but providing solutions 
could prove a valuable tool to help improve 
security in a future networking environment. 
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Abstract 


Cyclone is a safe dialect of C. It has been designed 
from the ground up to prevent the buffer overflows, 
format string attacks, and memory management cr- 
rors that are common in C programs, while retain- 
ing C’s syntax and semantics. This paper examines 
safcty violations enabled by C’s design, and shows 
how Cyclone avoids them, without giving up C's 
hallmark control over low-level details such as data 
representation and memory management. 


1 Introduction 


It is a commonly held belief in the security commu- 
nity that safety violations such as buffer overflows 
are unprofessional and even downright sloppy. This 
recent quote [33] is typical: 


Common errors that cause vulnerabilities 
— buffer overflows, poor handling of unex- 
pected types and amounts of data — are 
well understood. Unfortunately, features 
still seem to be valued more highly among 
manufacturers than reliability. 


The implication is that safety violations can be pre 
vented just by changing priorities. 


It's truc that highly trained and motivated program- 
mers can produce extremely robust systems when 
security is a top priority (witness OpenBSD). It's 
also true that most programmers can and should do 
more to ensure the safety and secunty of the pro- 
grams that they write. However. we believe that the 
reasons that. safety violations show up so oftenin C 
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programs reach deeper than just poor training and 
effort: they have their roots in the design of C itself. 


Take buffer overflows, for example. Every introduc- 
tory C programming course warns against them and 
teaches techniques to avoid them, yet they continue 
to be announced in security bulletins every week. 
There are reasons for this that are more findamien- 
tal than poor training: 


e One cause of buffer overflows in C is bad pointer 
arithmetic, and anthmcetic is tricky. To put it 
plainly, an off-by-one crror can cause a buffer 
overflow, and we will never be able to train pro- 
grammers to the point where off-by-one errors 
are completely eliminated. 


e C uses NUL-terminated strings. This is crucial 
for effidency (a buffer can be allocated once and 
used to hold many different strings of different 
lengths before deallocation), but there is always 
a danger of overwnting the NUL terminator, 
usually leading to a buffer overflow in a library 
function. Some library functions (strcat) have 
alternate versions (strncat) that help, by Iet- 
ting the programmer give a bound on the length 
of astring argument, but there arc many dozens 
of functions in POSIX with no such alternative. 


e Out-of-bounds pointers are commonplace in C. 
The standard way to iterate over the elements 
of an array is to start with a pointer to the first 
clement and increment it until it is just past 
the end of the array. This is blessed by the 
C standard, which states that the address just 
past the end of any array must be valid. When 
out-vof-bounds pointers are common, you have 
to expect that occasionally one will be derefer- 
enced or assigned, causing a buffer overflow. 


In short, the design of the C programming language 
encourages programming at the edge of safety. This 
makes programs efficient but also vulnerable, and 
leads us to conclude that safety violations are likedy 
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to reinain common in C programs. A number of 


studies bear this out [23, 11, 28, 18]. 


If C programs are unsafe, it is tempting to suggest 
that all programs be written in a safe language like 
Java (or ML, or Modula-3, or even 40-year-old Lisp). 
However, this is not a realistic sol ution for everyone. 
For one thing, it abandons legacy code. For another, 
all of the safe languages look very different from C: 
they are high-level and abstract, they do not have 
explicit memory management, and they do not give 
programmers control over low-level data representa- 
tions. These features make C unique, efficient, and 
indispensable to systems programmers. 


We are developing an alternative for those who want 
safety but do not want to switch to a high-level lan- 
guage: Cyclone, a dialect of C that has been de- 
signed to prevent safety violations. Our goal is to 
design Cyclone so that it has the safety guarantee 
of Java (no valid program can commit a safety vio- 
lation) while keeping C’s syntax, types, semantics, 
and idioms intact. In Cyclone. as in C, programmers 
can “feel the bits.” We think that C programmers 
will have little trouble adapting to our dialect and 
will find Cyclone to be an appropriate language for 
many of the problems that ask for a C solution. 


Cyclone has been in development for two years. In 
total, we have written about 110,000 lines of Cy- 
clone code, with about 35,000 lines for the compiler 
itself, and 15,000 lines for supporting libraries and 
tools, like a port of the Bison parser generator. We 
have also ported about 50,000 lines of benchmark 
applications, and are developing a streaming media 
overlay network in Cyclone {27]. Cyclone is freely 
available and comes with extensive documentation. 
The compiler and most of the accompanying tools 
are licensed under the GNU Gencral Public License, 
and most of the libraries are licensed under the GNU 
LGPL. 


This paper is a high-level overview of Cyclone. 
It presents the design philosophy behind Cyclone, 
gives an overview of the techniques we've used to 
make a safe version of C, and reviews the history 
of the project, the mistakes we've made. and the 
course corrections that they inspired. 


The remainder of the paper is organized as fol- 
lows. Section 2 points out some of the features of C 
that cam lead to safety violations, and describes the 
changes we made to prevent this in Cyclone. Sec- 
tion 3 gives some details about our implementation 
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and its performance. Section 4 discusses the evolu- 
tion of Cyclone’s design, pointing out key decisions 
that we made and mistakes that we later reversed. 
We discuss future work in Section 5. In section 6, we 
discuss existing approaches to making C safer, and 
explain how Cyclone’s approach is different. We 
conclude in Section 7. 


2 From C to Cyclone 


Most of Cyclone’s language design comes directly 
from C. Cyclone uses the C preprocessor, and, with 
few exceptions, follows C’s lexical conventions and 
grammar. Cyclone has pointers, arrays, structures, 
unions, enumerations, and all of the usual floating 
point and integer types; and they have the same 
data representation in Cyclone as in C. Cyclone’s 
standard library supports a large (and growing) sub- 
set of POSIX. The intention is to make it easy for 
C programmers to learn Cyclone, to port C code to 
Cyclone, and to interface C code with Cyclone code. 


The major differences between Cyclone and C are all 
related to safety. The Cyclone compiler performs a 
Static analysis on source code, and inserts run-time 
checks into the compiled output at places where the 
analysis cannot determine that an operation is safe. 
The compiler may also refuse to compile a program. 
This may be because the program is truly unsafe, 
or may be because the static analysis is not able to 
guarantee that the program is safe, even by inserting 
run-time checks. We reject. some programs that a C 
compiler would happily compile: this includes all 
of the unsafe C programs as well as some perfectly 
safe programs. We must reject some safe programs. 
because it is impossible to implement an analysis 
that perfectly separates the safe programs from the 
unsafe progranis. 


When Cydone rejects a safe C program, the pro- 
grammer may choosc to rewrite the program so that. 
our analysis can verify its safety. To make this eas- 
ier. we have identified common C idioms that our 
static analysis cannot handle, and have added fea- 
tures to the language so that these idioms can be 
programmed in Cyclone with only a few modifica- 
tions. These modifications typically include adding 
annotations that supply hints to the static analy- 
sis, or that cause the program to maintain extra in- 
formation needed for run-time checks (e.g., bounds 
checks). 
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Table 1: Restrictions imposed by Cyclone to pre 
serve safety 


e NULL checks are inserted to prevent segmenta- 
tion faults 


e Pointer arithmetic is restricted 
e Pointers must be initialized before use 


e Dangling pointers are prevented through region 
analysis and limitations on free 


e Only “safe” casts and unions are allowed 
e goto into scopes is disallowed 
e switch labels in different scopes are disallowed 


e Pointer-returning functions must execute 


return 


e setjmp and longjmp are not supported 


Cyclone can thus be understood by starting from C, 
imposing some restrictions to preserve safety, and 
adding features to regain common programming id- 
loms in a safe way. Cyclone’s restrictions are sutr 
marized in Table 1, and its extensions are summa- 
rized in Table 2. 


Some of the techniques we use to make Cyclone 
safe have been applied to C before, and there has 
been a great deal of research on additional tech- 
niques that we do not use in Cyclone However, 
previous projects have typically used only one or 
two techniques, resulting in incomplete coverage. 
For example, McGary’s bounded pointers protect 
against some, but not all. array access violations 
(26], and StackGuard protects against some, but not 
all, buffer overflows [9]. Our goal with Cyclone is 
to prevent all safety violations. Moreover. previous 
projects have been presented as optional adc-ons to 
C, so in practice they are selclom used in production 
code; Cyclone makes safety the default. 


In the rest of this section. we illustrate Cyclone’s 
features by giving examples of safety violations in 
C code, explaining how Cyclone's restrictions de- 
tect and prevent them, and introducing the lan- 
guage extensions that can be used to safely pro- 
gram around the restrictions. Some of the safety vi- 
olations we describe, like buffer overflows, can lead 
to root exploits. All of them can lead to crashes, 
which can be exploited to mount denial of service 


Table 2: Extensions provided by Cyclone to safely 
regain C programming idioms 


e Never-NULL pointers do not require NULL checks 


e “Fat” pointers support pointer arithmetic with 
run-time bounds checking 


e Growablc regions support a form of safe manual 
memory management 


e Tagged unions support type-varying arguments 


e Injections help automate the use of tagged 
unions for programmers 


e Polymorphism replaces some uses of void * 
e Varargs are implemented with fat pointers 


e Exceptions replace some uses of setjmp and 
longjmp 


attacks (6, 7, 12, 15, 25, 16]. 


NULL Consider the getc function: 


int getc(FILE *); 


If you call getc(NULL), what happens? The C stan- 
dard gives no definitive answer. If getc is written 
with safety in mind, it will perform a NULL check on 
its argument. That would be inefficient in the contr 
ion case, though, so the check is probably omitted, 
leading to a segmentation fault. 


Cydone provides two solutions. The first is to auto- 
inatically insert run-time NULL checks when pointers 
are used. For example, Cyclone will insert code into 
the body of getc to do a NULL check when its argu- 
ment is dercferenced. 


This requires little effort from the programmer, but 
the NULL checks slow down getc. To repair this, we 
have extended Cyclone with a new kind of pointer, 
called a “never-NULL” pointer. and indicated with 
‘@' instead of ‘*’. For example, in Cyclone you can 
declare 


int getc(FILE @) ; 
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indicating that getc expects a nomNULL FILE 
pointer as its argument. This one-character change 
tells Cyclone that it does not need to insert NULL 
checks into the body of getc. If getc is called with 
a possibly-NULL pointer. Cyclone will insert a NULL 
check at the call: 


extern FILE *f; 


getc(f); // NULL check here 


Cyclone prints a warning when it inserts the NULL 
check. This can be suppressed with an explicit cast: 


getc((FILE @)f); // Check w/o warning 


A prograinmer can force the NULL check to occur 
only once by declaring a new @-pointer variable, and 
using the new variable at each call: 


FILE @g = (FILE @)f; // NULL check here 
getc(g) ; // No NULL check 


Finally, constants like stdin are declared as @- 
pointers in the first place, and functions can be de- 
dared to return @ pointers. The effect is that NULL 
checks can be pushed back from their uses all the 
way to their sources. This is just as in C, except 
that in Cyclone, the compiler can ensure that NULL 
dercferences do not occur. 


Never-NULL pointers are a perfect example of Cy- 
clone’s design philosophy: safety is guaranteed, au- 
tomatically if possible. and the programmer has 
control over where any needed checks are performed. 


Buffer overflows To prevent buffer overfiows, we 
restrict pointer arithmetic: Cyclone does not per- 
mit pointer arithmetic on *-pointers or @-pointers. 
Instead, we provide another kind of pointer, indi- 
cated by ‘?°, which permits pointer arithmetic A 
?-pointer is represented by an address plus bounds 
information: since the representation of a ?-pointer 
takes up more space than a *-pointer or @-pointer, 
we call it a “fat” pointer. The extra information in 
a fat pointer allows Cyclone to determine the size 
of the array pointed to, and to insert: bounds checks 
at pointer accesses to ensure safety. 


Here’s an example of fat pointers in use — the string 
length function written in Cyclone: 


General Track: 2002 USENIX Annual Techn ical Conference 


int strlen(const char ?s) { 


Lie ae. me 
if (!'!s) return 0; 
Nn = s.size; 


for (i = 0; i < n; i++,s++) 
if (€l¥s) réturn i: 
return Nn; 


This looks like a C version of strlen, with two ex- 
ceptions. First. we declare the argument s to be a 
fat pointer to char, rather than a *-pointer. Scc- 
ond, in the body of the function we are able to get 
the size of the array pointed to by s, using the nota- 
tion s.size. This lets us check that s is imbounds 
in the for loop. That means we are guaranteed that 
we will never dereference s outside the bounds of 
the string, even if the NUL terminator is missing. 
In contrast, the C strlen will scan past the end of 
a string that lacks a NUL terminator. 


Fat pointers add overhead to programs. because 
they take up more space than other pointers, and 
because of inserted bounds checks. However, they 
ensure safety, they give the programmer new capa- 
bilities (finding the size of the base array), and the 
programmer has explicit control over where they 
are used. It’s easy to use ?-pointers in Cyclone. 
A programmer who wants t.o use a ?-pointer only 
needs to change a single character (‘*’ to ‘?’) in 
a declaration. Arrays and strings are converted to 
?-pointers as necessary (automatically by the cou 
piler). A programmer can explicitly cast a ?-pointer 
to a *-pointer (this inserts a bounds check) or to a 
@ pointer (this inserts a NULL check and a bounds 
check). A *-pointer or @-pointer can be cast to a ?- 
pointer, without any checks; the resulting ?-pointer 
has size 1. 


Uninitialized pointers The folowing snippet of 
C crashed one author’s Palm Pilot: 


Form *f; 
switch (event->eType) { 
case frm0penEvent: 
f = FrmGetActiveForm() ; 
case ctlSelectEvent: 
i = FrmGetObjectIndex(f, field) ; 
iF 


This is part of a function that processes events. The 
problem is that while the pointer f is properly ini- 
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tialized in the firsst case ofthe switch, it is (by over- 
sight) not initialized in the second case. So when 
the function FrmGetObjectIndex dcreferences f, it 
isn’t accessing a valid pointer, but rather an unpre- 
dictable address —- whatever was on the stack when 
the space for f was allocated. 


To prevent this in Cyclone, we perform a static anal- 
ysis on the source code. The analysis detects that 
f might be uninitialized in the second case, and the 
compiler signals an crror. Usually, this catches a 
real bug. but there are times when our analysis isn’t 
smart cnough to figure out that something is prop- 
erly initialized. This may force the programmer to 
initialize variables earlicr than in C. 


We don't consider it an crror if non-pointers are 
uninitialized. For example, if you declare a local 
array of non-pointers, you can use it without in- 
tializing the elements: 


char buf [64]; // contains garbage .. 
sprintf(buf,"a"); // .. but no err here 
char c = buf[20]; // .. or even here 


This is common in C code; since these array accesses 
are in-bounds, we allow them. 


Dangling pointers Here is a naive (unsafe!) ver- 
sion of a C function that takes an int and returns 
its string representation: 


char *itoa(int i) { 
char buf [20]; 
sprintf (buf, "4d" ,i) ; 
return buf; 


} 


The function allocates a character buffer on the 
stack, prints the int into the buffer, and returns a 
pointer to the buffer. The problem is that the caller 
now has a pointer into deallocated stack space: this 
can easily lead to safety violations. 


It is easy for a C compiler to warn against return- 
ing the address of a local variable, and, indeed, gcc 
prints just such a warning for the example above. 
However, this technique will not catch even the fol- 
lowing simple variation: 


char *itoa(int i) { 


char buf [20]; 

Char *2Z; 
sprintf (buf ,"%d",i) ; 
z = buf; 

return Z; 


Herc, the address of buf is stored in the variable 
z, and then z is returned. This passes gcc -Wall 
without: complaint. 


Cyclone prevents the dereference of dangling point- 
ers by performing a region analysis on the code. A 
region is a segment of memory that is deallocated 
all at once. For example, Cyclone considers all of 
the local variables of a block to be in the same re- 
gion, which is deallocated on exit from the block. 
Cyclone’s static region analysis keeps track of what 
region each pointer points into, and what regions 
are live at any point in the program. Any dercfer- 
ence of a pointer into a non-live region is reported 
as a compile-time error. 


In this last example, Cyclone’s region analysis 
knows that the address of buf is a pointer into the 
local stack of itoa. The assignment to z tells Cy- 
clone that. zis also a pointer into itoa’s stack area 
Since the local stack area will be deallocated when 
zis returned from itoa, we report an error. 


Cyclone’s region analysis is intraprocedural — it is 
not a whole-program analysis. We rely on prograi- 
mer annotations to track regions across function 
calls. For example, the strcat function is declared 
as follows in Cyclone: 


char ?‘r strcat(char ?‘r dest, 
const char ? src); 


Here ‘risa region variable. The declaration says 
that for any region ‘r, strcat takes a pointer dest 
into region ‘r, and a pointer src, and returns a 
pointer into region ‘r. (In fact, the C standard 
specifies that strcat returns dest.) This informa- 
tion enables Cyclone to correctly reject the following 
program: 


char 7itoa(int i) { 
char buf [20]; 
sprintf(buf ,"%d",i) ; 
return strcat(buf, '"); 


J 


General Track: 2002 USENIX Annual Technical Conference 





279 





280 


The region analysis deduces that the result of the 
call to strcat on buf points into the local stack 
region of itoa, so it cannot be returned from the 
function. 


Cyclone’s region analysis is described in greater de 
tail in a separate paper [21]. 


Free C’s free function can create dangling point- 
ers. and, depending on how it is implemented, can 
cause segmentation faults or even root compromises 
if used incorrectly (e.g., if it is called with a pointer 
not returned by malloc [16], or if it is used to re- 
claim the same block of memory twice [7]). It is 
difficult to design an analysis that can guarantee 
the correct use of pointers and free, so our current 
solution is drastic: we make free a no-op. 


Obviously, programmers still need a way to reclaim 
heap-allocated data. We provide two ways. First, 
the programmer can use an optional garbage collec- 
tor. This is very helpful in getting existing C pro- 
grams to port to Cyclone without many changes. 
However, in many cases it constitutes an unaccept- 
able loss of control. 


We recognize that C programmers need explicit, con- 
trol over allocation and deallocation. Therefore, 
Cyclone provides a feature called growable regions. 
The following code declares a growable region, does 
some allocation into the region, and deallocates the 
region: 


region h { 
int *x = rmalloc(h,sizeof(int)); 
int “yy ‘= qnewth) { 1, 2, 3 kh; 
char ?z2 = rprintf(h,"hello"); 

} 


The code uses a region block to start a new, grow- 
able region that lives on the heap. The region is 
deallocated on exit from the block (without an ex- 
plicit free). The variable h is a handle for the re- 
gion and it is used to allocate into the region, in one 
of several ways. 


First, there is an rmalloc construct that behaves 
like malloc except that it requires a region handle 
as an argument; it allocates into the region of the 
handle. In the example above, x is initialized with a 
pointer to an int-sized chunk of memory allocated 
in h’s region. 
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Second, the rnew construct is used when the pro- 
grammer wants to allocate and initialize in a single 
step. For example, y is initialized above as a fat 
pointer to an array with elements 1, 2, and 3, allo- 
cated in h’s region. 


Finally, region handles may be passed to fiinctions 
like the library function rprintf. rprintf is like 
sprintf, except that it does not print to a fixed- 
sized buffer; instead it allocates a buffer in a region, 
places the formatted output in the buffer, and re- 
turns a pointer to the buffer. In the example above, 
z is initialized with a pointer to the string “hello” 
that is allocated in h’s region. Unlike sprintf, there 
is no risk of a buffer overflow, and unlike snprintf, 
there is no risk of passing a buffer that is too small. 
Moreover, the allocated buffer will be freed when the 
region goes out of scope, just as a stack-allocated 
buffer would be. 


Our region analysis knows that x, y, and z all point 
into h’s region, and that the region is deallocated 
on exit fiom the block. It uses this knowledge to 
prevent dangling pointers into the region — for ex- 
ample, it prohibits storing x into a global variable, 
which could be used to (wrongly) access the region 
after it is deallocated. 


Growable regions are a safe version of arena-style 
memory management, which is widely used (e.g., 
in Apache). C programmers use many other styles 
of memory management, and we plan in the future 
to extend Cyclone to accommodate more of them 
safely. In the meantime, Cyclone is one of the very 
few safe languages that supports safe, explicit mem- 
ory management, without relying on a garbage col- 
lector. 


Type-varying arguments In C it is possible to 
write a function that takes an argument whiose type 
varies from call to call. The printf function is a 
familiar example: 


print de, 3); printtG" se", "hello"); 


In the first call to printf, the second argument is 
an int, and in the next call, the second argument 
isa char *. This is perfectly safe in this case, and 
the compiler can even catch errors by examining 
the format string to see what types the remaining 
arguments should have. Unfortunately, the compiler 
can’t catch all errors. Consider: 
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extern char *y; printf (y); 


This is a lazy way to print the string y. The problem 
is that, in general, y can contain % format directives, 
causing printf to look for non-existent arguments 
on the stack. The compiler can’t check this because 
y is not a string literal. A core dump is not unlikely. 


The danger is greater if the user of the program 
gets to choose the string y. The 4n format directive 
causes printf to write the number of characters 
printed so far into a location specified by a pointer 
argument; it can be used to write an arbitrary value 
to a location chosen by the attacker, leading to a 
complete compromise. This is known as a format 
string attack, and it is an increasingly common ex- 
ploit [34]. 


We solve this in Cyclone in two steps. First, we add 
tagged unions to the language: 


tunion t { 
Int(int) ; 
Str(char 7); 
es 


This declares a new tagged union type, tunion t. 
A tagged union has several cases, like an ordinary 
union, but adds tags that distinguish the cases. 
Here, tunion t has an int case with tag Int, and 
a char ? case with tag Str. A function that takes 
a tagged union as argument can look at the tags 
to find out what case the argument is in, using an 
extension of the switch statement: 


void pr(tunion t x) { 
switch (x) { 
case &Int(i): printf("%d",i); break; 
case &Str(s): printf("%s",s); break; 
} 

- 


The first case of the switch will be executed if x 
has tag Int; the variable i gets bound to the un- 
derlying int, so it can be used in the body of the 
case. Similarly, the second case is taken if x has tag 
Str with underlying string s. 


Tags enable the pr function above to correctly de- 
tect the type of its argument. However, callers have 
to explicitly add tags to the arguments. For exam- 
ple, pr can be called as follows: 


pr(new Int(4)); 
pr(new Str("hello")); 


The first line calls pr with the int 4, adding the tag 
Int with the notation new Int(4). The second call 
does the same with string “hello” and tag Str. 


Inserting the tags by hand is inconvenient, so we also 
provide a second feature, automatic tag injection. 
For example, in Cyclone, printf is declared 


printf(char ?fmt, . inject parg_t); 


where parg_t is a tagged union containing all of the 
possible types of arguments for printf. Cyclone’s 
printf is called just as in C, without explicit tags: 


printf("4s %i", "hello", 4); 


The compiler inserts the correct tags automatically 
(they are placed on the stack). The printf func 
tion itself accesses the tagged arguments through a 
fat pointer (Cyclone’s varargs are bounds checked) 
and uses switch to make sure the arguments have 
the right type. This makes printf safe even if the 
format string argument comes from user input — 
Cyclone does not permit the printf programmer 
to use the arguments in a type-inconsistent way. 
Moreover, the tags let the programmer detect any 
inconsistency at run time and take appropriate ac- 
tion (e.g., return an error codeor exit the program). 


Type-varying arguments are used in many other 
POSIX functions, including the scanf functions, 
fcentl, ioctl, signal, and socket functions such 
as bind and connect. Cyclone uses tagged unions 
and injection to make sure that these functions are 
called safely, while presenting the programmer with 
the same interface as in C. 


Goto C’s goto statements can lead to safety vi- 
olations when they are used to jump into scopes. 
Here is a simple example: 


int =z} 
{ int x = OxBAD; goto L; } 
{ int *y = &z; 
L: *y = 3; // Possible segfault 
} 
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The program declares a variable z, then enters two 
blocks in sequence. Many compilers stack allocate 
the local variables of a block when it is entered, and 
deallocate (pop) the storage when the block exits 
(though this is not mandated by the C standard). 
If the example is compiled in this way, then when 
the program enters the first block, space for x is allo- 
cated on the stack, and is initialized with the value 
OxBAD. The goto jumps into the middle of the sec- 
ond block, directly to the assignment to the contents 
of the pointer y. Since y is the first (only) variable 
declared in the second block, the assignment expects 
y to be at the top of the stack. Unfortunately, that’s 
exactly where x was allocated, so the program tries 
to write to location OxBAD, probably triggering a 
segmentation fault. 


Cycdlone’s static analysis detects this situation and 
signals an error. A goto that does not enter a scope 
is safe, and is allowed in Cyclone. We apply the 
same analysis to switch statements, which suffer 
from a similar vulnerability in C. 


Other vulnerabilities These are only a few of 
the features of C that can be inisused to cause safety 
violations. Other examples are: bad casts; varargs 
(as implemented in C); missing return statements; 
violations of const qualifiers; and improper use of 
unions. Cyclone’s analysis restricts these features 
to prevent safety violations. 


3 Implementation 


The Cyclone compiler is implemented in approxi- 
mately 35.000 lines of Cyclone. It consists of a 
parser, a static analysis phase, and a siiuple transla- 
tor to C. Weuse gcc as a back end and have also ex- 
perimented with using Microsoft Visual C++. We 
are able to use some existing tools (gdb, flex) and 
we ported others completely to Cyclone (bison). 
When a user compiles with garbage collection en- 
abled. we use the Boehm-Demers- Weiscr conserva- 
tive garbage collector as an off-the-shelf component. 
We have also built some useful utilities, including a 
documentation generation tool and a memory pro- 
filer. 


In order to get a rough idea of the current and po- 
tential performance of the language, we ported a 
selection of benchmarks from C to Cyclone. The 
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cacm 
cfrac 

finger 
grobner 
http_get 
http-load 
http_ping 
http_post 
matxmult 
mini_httpd 
ncompress 
tile 


total 18627 | 18847 


regionized benchmarks 
4192 903 
2986 |} 531 


1452 


a% [31% 


mini_httpd 





total 7223 7178 1034 


Table 3: Benchmark diffs 


benchmarks were useful in testing Cyclone’s safety 
guarantces as well as its performance: several of the 
benchmarks had safety violations that were revealed 
(and we subsequently fixed) when we ported them 
to Cydone The process of porting also tested the 
limitations of Cycdone’s interface to the C library 
and forced us to provide more complete library sup- 
port. For example, even small benchmarks such as 
finger and http_get make use of parts of the C 
library that the Cyclone compiler and other tools 
do not, such as sockets and signals. 


The benchmarks We tried to pick benchmarks 
from a range of problem domains. For network- 
ing, we used the mini_httpd web server; the web 
utilities http_get, http_post, http_ping, and 
http_load; and finger. The cfrac, grobner, 
tile, and matxmult benchmarks are computation 
ally intensive C applications that make heavy use of 
arrays and pointers. Finally, cacm and ncompress 
are conipression utilities. All of the benchinark pro- 
grams, in both C and Cyclone, can be found on the 
Cyclone homepage [10]. 


Ease of porting We have tried to design Cyclone 
so that existing C code can be ported with few mod- 
ifications. Table 3 quantifies the number of modifi- 
cations we needed to port the benchmarks. For each 
benchmark, the table shows the number of lines of 
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code in both the C and Cyclone versions. The diff # 
column shows the number of lines changed in cach 
port, and the C % column shows the percentage of 
lines changed relative to the original program size. 
In porting the first grouping of benchmarks, we tried 
to minimize changes. In particular, the benchmarks 
involving non-trivial dynamic memory management 
(cfrac, grobner, http_load, and tile), were com- 
piled with the garbage collector in Cyclone; all other 
benchmarks do not use the garbage collector. The 
second grouping gives results for versions of bench- 
marks that we modified to make use of Cyclone’s 
growable regions wherever possible 


Usually fewer than 10% of the lines needed to 
be changed to port the benchmarks to Cyclone. 
One of the most common changes was changing C- 
style * pointers to Cyclone ? pointers; for exam- 
ple, changing char * to char ?. The ? % col- 
umn of Table 3 shows the percentage of changes 
that, were of this form: generally, this simple change 
accounted for 20-50% of changed lines. Most of 
the other changes had to do with adapting to Cy- 
clone’s stricter requirements for allocation, initial 
ization, const enforcement, and function proto- 
typing. Typical changes of these forms included 
changing malloc to new, adding explict initializers, 
adding explidt const type qualifiers to casts, and 
ensuring that all functions have prototypes with ex- 
plidt return values. 


Performance ‘Table 4 compares the performance 
of the benchmarks in C, in Cyclone with bounds 
checking enabled, and in Cyclone with bounds 
checking disabled. Presently we do only very sim- 
ple bounds-check elimination, because our effort to 
dat.c has focused on safety, rather than performance; 
the gap between the second and third measurements 
gives an upper bound for the improvement we can 
expect from this in the future. 


We ran each benchmark twenty-one times on a 750 
MHz Pentium II] with 256MB of RAM, running 
Linux kernel 2.2.16-12, using gcc 2.96 as a back 
end. We used the gcc flags -03 and ~march=i686 
for compiling all the benchmarks. Because we 
observed skewed distributions tor the http bench- 
marks, we report medians and semi-interquartile 
ranges (SIQR).! For the non-web benchmarks (and 


1The semi-interquartile range is the difference between the 
high quartile and the low quartile divided by 2. This is a 
measure of variability, similar to standard deviation, recom- 
mended for skewed distributions [22]. 


some of the web benchmarks as well) the median 
and the mean were essentially identical, and the 
standard deviation was at most 2% of the mean. 


The table also shows the sowdown factor of Cyclone 
relative to C. We achieve near-zero overhead for 
I/O bound applications such as the web server and 
the http programs, but there is a considerable over- 
head for computationally-intensive benchmarks; the 
worst is grobner, which is almost a factor of three 
slower than the C version. We have seen slowdowns 
of a factor of six in pathological scenarios involving 
pointer arithmetic in other microbenchmarks not 
listed here. 


Two common sources of overhead in safe languages 
are garbage collection and bounds checking. The 
checked and unchecked columns of Table 4 show 
that bounds checks are an important component of 
our overhead, as expected. Garbage collection over- 
head is not as easy to measure. Profiling the garbage 
collected version of cfrac suggests that garbage col- 
lection accounts for approximately half of its over- 
head. Partially regionizing cfrac resulted in a 6% 
improvement with bounds checks on; but regioniz- 
ing can require significant changes to the program, 
so the value of this comparison is not clear. We 
expect that the overhead will vary widely for dif- 
ferent programs depending on their memory usage 
patterns; for example, http_load and tile make 
relatively little use of dynamic allocation, so they 
have almost no garbage collection overhead. 


Cyclone's representation of fat pointers turned out 
to be another important overhead. We represent 
fat pointers with three words: the base address, the 
bounds address. and the current pointer location 
(essentially the same representation used by Mc- 
Gary’s bounded pointers [26]). Compared to C’s 
pointers, fat pointers have a larger space overhead, 
larger cache footprint, increased parameter passing 
overhead, and increased register pressure, especially 
on the register-impoverished x86. Good code gen- 
eration can make a big difference: we found that 
using gcc’s -march=i686 flag increased the speed 
of programs making heavy use of fat pointers (such 
as cfrac ancl grobner) by as much as a factor of 
two, because it causes gcc to use a more efficient 
implementation of block copy. 


Safety We found array bounds violations in three 
benchmarks when we ported them from C to Cy- 
clone: mini_httpd, grobner, andtile. This was a 
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cacm 

cfrac! 

finger 

grobner! 
http.get 
http_tload! 
http_ping 
http_post 
matxmult 
minihttpd-1.15c 
ncompress-4.2.4 
tile! 


C time(s) 


0.12 + 0.00 
2.30 + 0.00 
0.54 + 0.42 
0.03 + 0.00 
0.32 + 0.03 
0.16 + 0.00 
0.06 + 0.02 
0.04 + 0.01 
1.37 + 0.00 
2.05 + 0.00 
0.14 + 0.01 
0.44 + 0.00 


| checked(s) 


0.15 + 0.00 
9.07 + 0.01 
0.48 + 0.15 
0.07 + 0.00 
0.33 + 0.02 
0.16 + 0.00 
0.06 + 0.02 
0.04 + 0.00 
1.50 + 0.00 
2.09 + 0.00 
0.19 + 0.00 
0.74 + 0.00 


Compiled with the garbage collector 


Cyclone time 


factor 


regionized benchmarks 


unchecked (s) 
0.14 + 0.00 
4.77 +0.01 
0.53 + 0.16 
0.07 + 0.00 
0.32 + 0.06 
0.16 + 0.00 
0.06 + 0.01 
0.04 + 0.01 
1.37 + 0.00 
2.09 + 0.00 
0.18 + 0.00 
0.67 + 0.00 


factor 





cfrac 


2.30 + 0.00 | 5.22 + 0.01 


2.27x | 4554000 1.98x 





mini-httpd-1.15¢e 


2.05 + 0.00 | 2.09 + 0.00 


102x | 2.08+0.00 1.01x | 


Table 4: Benchmark performance 


surprise, since at least one (grobner) dates back to 
the mid 1980s. On the other hand, this is consistent 
with research that shows that such bugs can linger 
for years even in widely used software [28}. 


The mini_httpd web server consults a file, 
. htpasswd, to decide whether to grant client access 
to protected web pages. It tries to be careful not 
to reveal the password file to clicnts. Ironically, the 
code to protect the password file contains a safety 
violation: 


#define AUTH_FILE ".htpasswd" 
. stremp(&(filelstrlen(file) - 
sizeof (AUTH_FILE) + 1]), 
AUTH_FILE) == 


The code is trying to see if the file requested by 
the cient is .htpasswd. Unfortunately, if fileisa 
string shorter than .-htpasswd, then strcmp will be 
passed an out-of-bounds pointer. This could result 
in access to file being denied (if the region of mem- 
ory just before the string constant ". htpasswd" 
happens to contain that file name), or it could cause 
the program to crash (if the region of memory is in- 
accessible). Cyclone found the error with a run-time 
bounds check. 


The grobner benchmark had a more serious vio- 
lation affecting both safety and correctness The 
program represents polynomials as arrays of coeff- 


General Track: 2002 USENIX Annual Technical Conference 


cients, and has a multiply routine that handles pol y- 
nomials with a single coeficent as a special case. 
Unfortunately, the code for the general case turns 
out to be completely wrong: a loop is unrolled in- 
correctly, and the multiplication ends up being ap- 
plied to out-of-bounds pointers. As a result, the 
answers returned are unpredictable. Four of the ten 
test cases provided in the distribution follow this 
code path (in our performance experiments above, 
we consider only the six correct input cases). In 
Cyclone, our bounds checks quickly illuminated the 
source of the problem. 


The tile program had array bounds violations duc 
to an off-by-one error and an order-of-evaluation 
bug in this code: 


if (snum > cur_sentsize) 
mksentarrays(cur_sentsize, 
Ccur_sentsize += GROWSENT) ; 


The function mksentarrays reallocates several 
global arrays. Reallocation is supposed to occur 
when snum is greater than or equal to cur_sentsize; 
the if guard above has an off-by-one error. Cyclone 
caught this with a bounds check in mksentarrays. 
In addition, the first argument of mksentarrays 
should be the old size of the array, and the second 
argument should be the new size. Our platform uses 
nght-to-left evaluation, so the code above passes the 
new size of the array to mksentarrays in both argu- 
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ments. Again, this was caught with a bounds check 
in Cyclone In C, the out-of-bounds access was not 
caught, causing an incorrect initialization of the new 
arrays. 


4 Design history 


Cyclone began as an offshoot of the Typed Assem- 
bly Language (TAL) project [30, 20]. The TAL 
project’s goal was to ensure program safety at the 
machine code level, by adding machine-checkable 
safety annotations to machine code. The machine 
code annotations are not easy to produce by hand, 
so we designed a simple, C-like language called Pop- 
corn as a front end, and built a compiler that auto- 
matically translates Popcorn to machine code plus 
the necessary annotations. 


Popcorn worked out well as a proof-of-concept for 
TAL, but it had some disadvantages. It was C-like, 
but different enough to make porting C code and in- 
terfacing to C code difficult. It was also a language 
that was used only by our own research group, and 
was unlikely to be adopted by anyone else. Cyclone 
is a reworking of Popcorn with two agendas: to fur- 
ther our understanding of low-level safety, and to 
gain outside adopters. 


It turns out that taking C compatibility as a seri- 
ous requirement was critical to advancing both of 
these agendas. It was obvious from the start that 
C compatibility would make Cyclone more appeal- 
ing to others, but the idea that it would help us 
to understand how to better design a safe low-level 
language was a Surprise. 


C programmers don’t write the same kinds of pro- 
grams as programmers in safe languages like Java — 
they use many tricks that aren’t available in high- 
level languages. While many C programs are not 
100% safe, most are intended to be safe, and we 
learned a great deal from porting systems code from 
C to Cyclone. Often, we found that we had made 
choices in the design of Cyclone that were holdovers 
from ML [29], another language that we had worked 
on. Some (most!) of these choices were right for ML, 
but not for C, or for Cyclone, arid we ended up fol- 
lowing C more closely than we had expected at the 
start. 


All of this has played out gradually over the years of 


Cyclone’s development. Here are some of the more 
notable mistakes and course changes we’ve made: 


e Originally, we supported arrays not with fat 


pointers, but with a type array<t>, where t 
is the element type of the array. An array<t> 
could be passed to functions, and a value of 
type array<t> supported subscripting, but not 
pointer arithmetic. This matches up closely 
with ML’s array types, and was a carryover 
from when Popcorn was implemented in ML. 
However, converting C code to use array<t> 
was painful, requiring nontrivial editing of type 
declarations, and converting pointer arithmetic 
to array subscripting. We abandoned it for fat 
pointers, which make it easy to port C code, re- 
quiring only a few changes from ‘** to ‘?’, and 
no changes to pointer arithmetic. 


We didn’t understand the importance of NUL 
terminated strings. NUL termination isn’t 
guaranteed in C, so, for safety, we were com- 
mitted to using explicit array bounds from the 
beginning. The NUL seemed pointless, and 
our first string library ignored it. As we pro- 
grammed more in the language and ported C 
code, we came to understand how important 
NUL is to efficiency (memory reuse), and we 
changed our string library to match up with 
C’s. 


In C, a switch case by default falls through to 
the next case, unless there is an explicit break. 
This is exactly the opposite of what it should 
be: most cases do not fall through, and, more- 
over, when a case does fall through, it is prob- 
ably a bug. Therefore, we added an explicit 
fallthru statement, and used the rule that a 
case would not fall through unless the fallthru 
statement was used. 


Our decision to “correct” C’s mistake was 
wrong. It made porting error-prone because 
we had to examine every switch statement to 
look for intentional fall throughs, and add a 
fallthru statement. We had also gotten rid of 
any special meaning of break within switch, 
since it was no longer needed — consequently, 
a break in a switch within a loop would break 
to the head of the loop (in early versions of Cy- 
clone). Eventually, we realized that we were 
going against a basic instinct of every C pro- 
grammer, without gaining much of anything, so 
we restored C’s semantics of switch and break. 
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e We originally implemented tagged unions as an 
extension of enumerations, since an cnumcra- 
tion constant is like a case of a tagged union 
with no associated value. Since a tagged union 
is more general, we decided to just have one of 
the two. 


This was a mistake because in C, an enumcer- 
ation is really treated as int, and C program- 
mers rely on this. It’s not uncommon to see 
things like 
x = (x+1)4%3; 

where x is an enumeration variable. We aren't 
able to do this with tagged unions. so we even- 
tually separated them from enum. 


5 Future work 


C programmers use a wide variety of memory man- 
agement strategies, but at the moment, Cyclone 
supports only garbage collection and arena mem- 
ory management. A major goal of the project going 
forward will be to research ways to accommodate 
other memory management strategies, while retain- 
ing safety. 


Another limitation of our current release is that as- 
signments to fat pointers are not atomic, and hence, 
are not thread-safe. We plan to address this by re- 
quiring the programmer to acquire a lock betore ac- 
cessing a thread-shared fat pointer; this will be en- 
forced by an extension of the type system. Locks 
will not be necessary for thread-local fat pointers. 


We are experimenting with a number of new pointer 
representations. For instance, a pointer to a zero- 
terminated array can be safely represented as just an 
address, as long as the pointer only moves forward 
inside the array, and the zero terminator is never 
overridden. The new representations should make 
it easier to interface to legacy C code as well as 
improve on the space overhead of fat pointers. 


Finally, we plan to explore ways to automatically 
translate C programs into Cyclone. We have the 
beginnings of this in the compiler itself (which tries 
to report informative errors at places where code 
needs to be modified), and in a tool we built to 
semi-automatically construct a Cyclone interface to 
C libraries. 
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6 Related work 


There is an enormous body of research on making 
C safer. Most techniques can be grouped into one 
of the following strategies: 


1. Static analysis. Programs like Lint crawl over C 
source code and flag possible safety violations, 
which the programmer can then review. Some 
other examples are LCLint {17, 24], Metal [13, 
14], SLAM [3, 2], PREfix [5], and cqual {32]. 


2. Inserting run-time checks. C’s assert state 
ments, the Safe-C system {1]. and “debugging” 
versions of libraries, like Electric Fence, cause 
programs to perform sanity checks as they run. 
This technique has been used to combat buffer 
overflows [9, 4, 19] and printf format string 
attacks [8]. 


3. Combining static analysis and run-time checks. 
Systems like CCured [31] perform static anal 
yses to check source code for safety, and auto- 
matically insert run-time checks where safety 
cannot be guaranteed statically. 


These are good techniques -- Cydonc itself uses the 
third strategy. However, except for CCured, none of 
the above projects applies them in away that comes 
close to ruling out all of the safety violations found 
in C. It is not hard for a program to pass LINT 
and still crash, and even the more advanced check- 
ing systems, like LCLint, SLAM, and Metal, do not 
find all safety violations. We can say somcthing si m- 
ilar about all of the other systems mentioned above. 
Furthermore, most of these systems are simply not 
used -—-- assert is probably the most popular, but it 
is usually turned off when code is shipped to avoid 
performance degradation. 


CCured and Cyclone both seck to rule out all 
safety violations. The main disadvantage of CCured 
is that it takes control away from programmers. 
CCured needs to maintain some extra bookkeeping 
information in order to perform necessary run-time 
checks, and it does this by modifying data repre- 
sentations. For example, an int * might be rep- 
resented by just an address, but it might also be 
represented by an address plus extra data that al- 
lows bounds checking. This means that CCured 
has control over data representations, not the pro- 
grammer; and, moreover, basic operations (derefer- 
encing, pointer arithmetic) will have different costs, 
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depending on the decisions made by CCured. Fur- 
thermore. CCured relics on a garbage collector, so 
programmers have less control over memory man- 
agement. All of these decaisions were made because 
CCured is most concerned with porting legacy code 
with little or no change; Cyclone is concerned with 
preserving C's hallmark control over low-level de- 
tails such as data representation and memory man- 
agement, both when porting old code and writing 
new code. 


7 Conclusion 


Cycloneis a C dialect that prevents safety violations 
inl programs using a combination of static analyses 
and inserted run-time checks. Cyclone’s goal is to 
accomodate C’s style of low-level programming, 
while providing the same level of safety guaranteed 
by high-level safe languages like Java — a level of 
safety that has not been achieved by previous ap- 
proaches, 
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Abstract 


Cooperative task management can provide program ar- 
chitects with ease of reasoning about concurrency 1s- 
sues. This property is often espoused by those who 
recommend “event-driven” programming over “multi- 
threaded” programming. Those terms conflate several 
issues. In this paper, we clarify the issues, and show how 
one can get the best of both worlds: reason more simply 
about concurrency in the way “event-driven” advocates 
recommend, while preserving the readability and main- 
tainability of code associated with “multithreaded” pro- 
gramming. 


We identify the source of confusion about the two pro- 
gramming styles as a conflation of two concepts: task 
management and stack management. Those two con- 
cerns define a two-axis space in which “multithreaded” 
and “‘event-driven’”’ programming are diagonally oppo- 
site; there is a third “‘sweet spot” in the space that com- 
bines the advantages of both programming styles. We 
point out pitfalls in both alternative forms of stack man- 
agement, manual and automatic, and we supply tech- 
niques that mitigate the danger in the automatic case. 
Finally, we exhibit adaptors that enable automatic stack 
management code and manual stack management code 
to interoperate in the same code base. 


1 Introduction 


Our team embarked on a new project and faced the ques- 
tion of what programming model to use. Each team 
member had been burned by concurrency issues in the 
past, encountering bugs that were difficult to even repro- 
duce, much less identify and remove. We chose to fol- 
low the collective wisdom of the community as we un- 


derstood it, which suggests that an “event-driven” pro- 
gramming model can simplify concurrency issues by 
reducing opportunities for race conditions and dead- 
locks [Ous96]. However, as we gained experience, we 
realized that the popular term “‘event-driven’’ conflates 
several distinct concepts; most importantly, it suggests 
that a gain in reasoning about concurrency cannot be 
had without cumbersome manual stack management. By 
separating these concerns, we were able to realize the 
“best of both worlds.”’ 


In Section 2, we define the two distinct concepts whose 
conflation 1s problematic, and we touch on three related 
concepts to avoid confusing them with the central ideas. 
The key concept is that one can choose the reasoning 
benefits of cooperative task management without sacri- 
ficing the readability and maintainability of automatic 
stack management. Section 3 focuses on the topic of 
stack management, describing how software evolution 
exacerbates problems both for code using manual stack 
management as well as code using automatic stack man- 
agement. We show how the most insidious problem with 
automatic stack management can be alleviated. Sec- 
tion 4 presents our hybrid stack-management model that 
allows code using automatic stack management to co- 
exist and interoperate in the same program with code 
using manual stack management; this model helped us 
find peace within a group of developers that disagreed 
on which method to use. Section 5 discusses our expc- 
rience in implementing these ideas in two different sys- 
tems. Section 6 relates our observations to other work 
and Secton 7 summarizes our conclusions. 


2 Definitions 


In this section, we define and describe five distinct 
concepts: task management, stack management, l/O 
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response management, conflict management, and data 
partitioning. These concepts are not completely orthog- 
onal, but considering them independently helps us un- 
derstand how they interact in a complete design. We 
tease apart these concerns and then return to look at how 
the concepts have been popularly conflated. 


2.1 Task management 


One can often divide the work a program does into com 
ceptually separate tasks: each task encapsulates a com 
trol flow, and all of the tasks access some common, 
shared state. High-performance programs are often writ- 
ten with preemptive task management, wherein execu- 
tion of tasks can interleave on uniprocessors or overlap 
on multiprocessors. The opposite approach, serial task 
management, runs each task to completion before start- 
ing the next task. Its ad vantage is that there is no conflict 
of access to the shared state; one can define inter-task 
invariants on the shared state and be assured that, while 
the present task is running, no other tasks can violate 
the invariants. The strategy is inappropriate, however, 
when one wishes to exploit multiprocessor parallelism, 
or when slow tasks must not defer later tasks for a long 
time. 


A compromise approach is cooperative task manage- 
ment. In this approach, a task’s code only yields con- 
trol to other tasks at well-de fined points in its execution; 
usually only when the task must wait for long-running 
I/O. The approach is valuable when tasks must inter- 
leave to avoid waiting on each other’s I/O, but multi- 
processor parallelism is not crucial for good application 
performance. 


Cooperative task management preserves some of the ad- 
vantage of serial task management in that invariants on 
the global state only need be restored when a task explic- 
itly yields, and they can be assumed to be valid when the 
task resumes. Cooperative task management is harder 
than serial in that, if the task has local state that depends 
on the global state before yielding, that state may be in 
valid when the task resumes. The same problem appears 
in preemptive task management when releasing locks for 
the duration of a slow I/O operation [Bir89]. 


One penalty for adopting cooperative task management 
is that every I/O library function called must be wrapped 
so that instead of blocking, the function initiates the I/O 
and yields control to another task. The wrapper must 
also arrange for its task to become schedulable when the 
I/O completes. 
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2.2 Stack management 


The common approach to achieving cooperative task 
management is to organize a program as a collection 
of event handlers. Say a task involves receiving a net- 
work message, reading a block from disk, and replying 
to the message. The receipt of the message is an event; 
One procedure handles that event and inttiates the disk 
I/O. The receipt of the disk I/O result is a second event; 
another procedure handles that event and constructs the 
network reply message. The desired task management 
is achieved, in that other tasks may make progress while 
the present task is waiting on the disk I/O. 


We call the approach just described manual stack man- 
agement. As we argue in Section 3.1, the problem is that 
the control flow for a single conceptual task and its task- 
specific state are broken across several language proce- 
dures, effectively discarding language scoping features. 
This problem is subtle because it causes the most trou- 
ble as software evolves. It is important to observe that 
one can choose cooperative task management for its be n- 
efits while exploiting the automatic stack management 
afforded by a structured programming language. We de- 
scribe how in Section 3.3. 


Some languages have a built-in facility for transpar- 
ently constructing closures; Scheme’s call-with-c urre nt- 
continuation is an obvious example [HF W84, FHK84]. 
Such a facility obviates the idea of manual stack 
management altogether. This paper focuses on the 
stack management problem in conventional systems lan 
guages without elegant closures. 


2.3 W/O management 


While this paper focuses on the first two axes, we ex- 
plicitly mention three other axes to avoid confusing them 
with the first two. The first concerns the question of syn- 
chronous versus asynchronous 1/O management, which 
is orthogonal to the axis of task management. An 1/O 
programming interface is synchronous if the calling task 
appears to block at the call site until the I/O completes, 
and then resume execution. An asynchronous interface 
call appears to return control to the caller immediately. 
The calling code may initiate several overlapping asyn- 
chronous operations, then later wait for the results to ar- 
rive, perhaps in arbitrary order. This form of concur- 
rency is different than task management because I/O op- 
erations can be considered independently from the com- 
putation they overlap, since the I/O does not access the 
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shared state of the computation. Code obeying any of 
the forms of task management can call either type of I/O 
interface. Furthermore, with the right primitives, one 
can build wrappers to make synchronous interfaces out 
of asynchronous ones, and vice versa; we do just that in 
our systems. 


2.4 Conflict management 


Diffierent task management approaches offer different 
granularities of atomicity on shared state. Conflict man- 
agement considers how to convert available atomicity to 
a meaningful mechanism for avoiding resource conflicts. 
In serial task management, for example, an entire task 1s 
an atomic operation on shared state, so no explicit mech- 
anism 1s needed to avoid inter-task conflicts on shared 
resources. In the limiting case of preemptive task man- 
agement, where other tasks are executing concurrently, 
tasks must ensure that invariants hold on the shared state 
all the time. 


The general solution to this problem 1s synchronization 
primitives, such as locks, semaphores, and monitors. 
Based on small atomic operations supplied by the ma- 
chine or runtime environment, synchronization primi- 
tives let us construct mechanisms that maintain complex 
invariants on shared state that always hold. Synchro- 
nization mechanisms may be pessimistic or optimistic. 
A pessimistic mechanism locks other tasks out of the 
resources it needs to complete a computation. An op- 
timistic primitive computes results speculatively; if the 
computation turns out to conflict with a concurrent task’s 
computation, the mechanism retries, perhaps also falling 
back on a pessimistic mechanism if no forward progress 
is being made. 


Cooperative (or serial) task management effectively pro- 
vides arbitrarily large atomic operations: all of the code 
executed between two explicit yield points is executed 
atomically. Therefore, it is straightforward to build 
many complex invariants safely. This approach is analo- 
gous to the construction of atomic sequences with inter 
rupt masking in uniprocessor OS kernels. We discuss in 
Section 3.3 how to ensure that code dependent on atom- 
icity stays atomic as software evolves. 


2.5 Data partitioning 


Task management and conflict management work to- 
gether to address the problem of potentially-concurrent 
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Figure 1: Two axes that are frequently conflated. 


access to shared state. By partitioning that state, we can 
reduce the number of opportunities for conflict. For ex- 
ample, task-specific state needs no concurrency consid- 
erations because it has been explicitly partitioned from 
shared state. Data is transferred between partitions by 
value; one must be careful to handle implicit references 
(such as a value that actually indexes an array) thought- 
fully. 


Explicitly introducing data partitions to reduce the de- 
gree of sharing of shared state can make it easier to write 
and reason about invariants on each partition; data par- 
titioning is an orthogonal approach to those mentioned 
previously. 


2.6 How the concepts relate 


We have described five distinct concepts. They are not 
all precisely orthogonal, but it is usefil to consider the 
effects of choices in each dimension separately. Most 
importantly, for the purposes of this paper, the task man- 
agement and stack management axes are indeed orthog- 
onal (see Figure 1). 


The idea behind Figure | 1s that conventional concurrent 
programming uses preemptive task management and ex- 
ploits the automatic stack management of a standard lan- 
guage. We often hear this point in the space referred to 
by the term “threaded programming.” The second in- 
teresting point in the space is “event-driven program- 
ming,” where cooperative tasks are organized as event 
handlers that yield control by returning control to the 
event scheduler, manually unrolling their stacks. This 
paper is organized around the observation that one can 
choose cooperative task management while preserving 
the automatic stack management that makes a program- 


General Track: 2002 USENIX Annual Technical Conference 


291 


292 


ming language “structured;” in the diagram, this point is 
labeled the ‘“‘sweet spot.” 


3 Stack management 


Given our diagram one might ask, “what are the pros and 
cons of the two forms of stack management? We address 
that question here. We present the principal advantages 
and disadvantages of each form, emphasizing how soft- 
ware evolution exacerbates the disadvantages of each. 
We also present a technique that mitigates the principal 
disadvantage of automatic stack management. 


3.1 Automatic versus manual 


Programmers can express a task employing either auto- 
matic stack management or manual stack management. 
With automatic stack management, the programmer ex- 
presses each complete task as a single procedure in the 
source language. Such a procedure may call functions 
that block on I/O operations such as disk or remote re- 
quests. While the task is waiting on a blocking opera- 
tion, its current state is kept in data stored on the pro- 
cedure’s program stack. This style of control flow is 
one meaning often associated with the term “procedure- 
oriented.” 


In contrast, manual stack management requires a pro- 
grammer to rip the code for any given task into event 
handlers that run to completion without blocking. Event 
handlers are procedures that can be invoked by an event- 
handling scheduler in response to events, such as the 
initiation of a task or the response from a previously- 
requested J/O. To initiate an I/O, an event handler “£1” 
schedules a request for the operation but does not wait 
for the reply. Instead, £; registers a task-specific object 
called a continuation [FHK84] with the event-handling 
scheduler. The continuation bundles state indicating 
where £; left off working on the task, plus a reference 
to a different event-handler procedure /2 that encodes 
what should be done when the requested I/O has com- 
pleted. After having initiated the I/O and registering the 
continuation, £; returns control to the event-handling 
scheduler. When the event representing the I/O com- 
pletion occurs, the event-handling scheduler calls £2, 
passing £,’s bundled state as an argument. This style 
of control flow is often associated with the term “event- 
driven.” 


To illustrate these two stack-management styles, con- 
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sider the code for a function, GetCAInfo, that looks 
in an in-memory hash table for a specified certificate- 
authority id and returns a pointer to the corresponding 
object. A certificate authority is an entity that issues cer- 
tificates, for example for users of a file system. 


CAInfo GetCAInfo(CAID caId) { 
CAInfo caInfo = LookupHashTable(calIqd) ; 
return calInfo; 


Suppose that initially this function was designed to han- 
dle a few globally known certificate authorities and 
hence all the CA records could be stored in memory. We 
refer to such a function as a compute-only function: be- 
cause it does not pause for I/O, we need not consider 
how its stack is managed across an I/O call, and thus the 
automatic stack management supplied by the compiler is 
always appropriate. 


Now suppose the function evolves to support an abun- 
dance of CA objects. We may wish to convert the hash 
table into an on-disk structure, with an in-memory cache 
of the entries in use. GetCAInfo has become a func- 
tion that may have to yield for I/O. How the code evolves 
depends on whether it uses automatic or manual stack 
management. 


Following is code with automatic stack management that 
implements the revised function: 


CAInfo GetCAInfoBlocking(CAID caId) { 
CAInfo caInfo = LookupHashTable(calIqd) ; 
if (caInfo != NULL) { 

// Found node in the hash table 
return calInfo; 
} 
caInfo = new CAInfo(); 
// DiskRead blocks waiting for 
// the disk IYO to complete. 
DiskRead(caId, caInfo) ; 
InsertHashTable(caId, CalInfo) ; 
return calInfo; 


To achieve the same goal using manual stack 
management, we rip the single conceptual func- 
tion GetCAInfoBlocking into two source-language 
functions, so that the second function can be called from 
the event-handler scheduler to continue after the disk 
I/O has completed. Here is the continuation object that 
stores the bundled state and function pointer: 


Class Continuation { 
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// The function called when this 

// continuation ts scheduled to run. 

void (*function) (Continuation cont); 
// Return value set by the YO operation. 

// To be passed to continuation. 

void *returnValue 

// Bundled up state 

void *argl, *arg2, ...; 


Here 1s the original function, ripped into the two parts 
that function as event handlers: 


void GetCAInfoHandlerl1(CAID calId, 
Continuation *callerCont) 


// Return the result immediately if in cache 
CAInfo *caInfo = LookupHashTable(caIqd) ; 
if (caInfo != NULL) { 
// Call caller’s continuation with result 
(*callerCont—>£unction) (caInfo); 
return; 


} 


// Make buffer space for disk read 

caInfo = new CAInfo(); 

// Save return address & live variables 

Continuation *cont = new 
Continuation (&GetCAInfoHandler?2, 

caId, caInfo, callerCont); 
// Send request 
EventHandle eh = 
InitAsyncDiskRead(caId, caInfo) ; 
// Schedule event handler to run on reply 
// by registering continuation 


RegisterContinuation(eh, cont) ; 


void GetCAInfoHandler2 (Continuation 
*cont) { 
// Recover live variables 
CAID caId = (CAID) cont—>argi; 
CAInfo *«caInfo = (CAInfo*) cont—>arg2; 
Continuation *callerCont = 
(Continuation*) cont—>arg3; 
// Stash CAInfo object in hash 
InsertHashTable(caId, calInfo); 
// Now “return” results to original caller 
(*callerCont—>function) (callerCont) ; 


Note that the signature of Get CAInfo 1s different from 
that of Get CAInfoHandleri. Since the desired re- 


sult from what used to be Get CAInfo will not be avail- 
able until GetCAInfoHandler2 runs sometime later, 
the caller of GetCAInfoHandlerl must pass in a 
continuation that GetCAInfoHandler2 can later in- 
voke in order to return the desired result via the continu- 
ation record. That is, with manual stack management, a 
statement that retums control (and perhaps a value) to a 
caller must be simulated by a function call to a continu- 
ation procedure. 


3.2 Stack Ripping 


In conventional systems languages, such as C++, which 
have no support for closures, the programmer has todo a 
substantial amount of manual stack management to yield 
for I/O operations. Note that the function in the previ- 
ous section was ripped into two parts because of one /O 
call. If there are more I/O calls, there are even more rips 
in the code. The situation gets worse still with the pres- 
ence of control structures such as for loops. The pro- 
grammer deconstructs the language stack, reconstructs 
it on the heap, and reduces the readability of the code in 
the process. 


Furthermore, debugging 1s impaired because when the 
debugger stops in GetCAInfoHandler2, the call 
stack only shows the state of the current event han- 
dler and provides no information about the sequence of 
events that the ripped task performed before arriving at 
the current event handler invocation. Theoretically, one 
can manually recover the call stack by tracing through 
the continuation objects; in practice we have observed 
that programmers hand-optimize away tail calls, so that 
much of the stack goes missing. 


In summary, for each routine that 1s ripped, the program- 
mer will have to manually manage procedural language 
features that are normally handled by a compiler: 


function scoping Now two or more language functions 
represent a single conceptual function. 


automatic variables Variables once allocated on the 
stack by the language must be moved into a new 
State structure stored on the heap to survive across 
yield points. 


control structures The entry point to every basic block 
containing a function that might block must be 
reachable from a continuation, and hence must be 
a separate language-level function. That is, con- 
ceptual functions with loops must be ripped into 
more than two pieces. 
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debugging stack The call stack must be manually re- 
covered when debugging, and manual optimiza- 
tion of tail calls may make it unrecoverable. 


Software evolution substantially magnifies the problem 
of function ripping: when a function evolves from be- 
ing compute-only to potentially yielding, ali functions, 
along every path from the function whose concurrency 
semantics have changed to the root of the call graph may 
potentially have to be ripped in two. (More precisely, all 
functions up a branch of the call graph will have to be 
ripped until a function is encountered that already makes 
its call in continuation-passing form.) We call this phe- 
nomenon “stack ripping” and see it as the primary draw- 
back to manual stack management. Note that, as with 
all global evolutions, functions on the call graph may be 
maintained by different parties, making the change dif- 
ficult. 


3.3. Hidden concurrency assumptions 


The huge advantage of manual stack management !s 
that every yield point 1s explicitly visible in the code 
at every level of the call graph. In contrast, the call 
to DiskRead in GetCAInfo hides potential concur- 
rency. Local state extracted from shared state before 
the DiskRead call may need to be reevaluated after 
the call. Absent a comment, the programmer cannot tell 
which function calls may yield and which local state to 
revalidate as a consequence thereof. 


As with manual stack management, software evolution 
makes the situation even worse. A call that did not yield 
yesterday may be changed tomorrow to yield for I/O. 
However, when a function with manual stack manage- 
ment evolves to yield for I/O, its signature changes to 
reflect the new structure, and the compiler will call atten- 
tion to any callers of the function unaware of the evolu- 
tion. With automatic stack management, such a change 
is syntactically invisible and yet it affects the semantics 
of every function that calls the evolved function, either 
directly or transitively. 


The dangerous aspect of automatic stack management is 
that a semantic property (y/elding) of a called procedure 
dramatically affects how the calling procedure should be 
written, but there is no check that the calling procedure 
is honoring the property. Happily, concurrency assump- 
tions can be declared explicitly and checked statically or 
dynamically. 


A static check would be ideal because it detects viola- 
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tions at compile time. Functions that yield are tagged 
with the yielding property, and each block of a calling 
function that assumes that it runs without yielding 1s 
marked atomic. The compiler or a static tool checks 
that functions that call yielding functions are themselves 
marked yielding, and that no calls to yielding functions 
appear inside atomic blocks. In fact, one could rea- 
sonably abuse an exception declaration mechanism to 
achieve this end. 


A dynamic check is less desirable than a static one be- 
cause violations are only found if they occur at runtime. 
It is still useful in that violations cause an immediate 
failure, rather than subtly corrupting system state in a 
way that ts difficult to trace back to its cause. We chose 
a dynamic check because it was quick and easy to 1m- 
plement. Each block of code that depends on atomicity 
begins with a call to startAtomic() and ends with 
acallto endAtomic(). The startAtomic() func- 
tion increments a private counter and endAtomic ( ) 
decrements it. When any function tries to block on I/O, 
yield() asserts that the counter 1s zero, and dumps 
core otherwise. 


Note that in evolving code employing automatic stack 
management, we may also have to modify every func- 
tion extending along every path up the call graph from 
a function whose concurrency semantics have changed. 
However, whereas manual stack management implies 
that each affected function must be torn apart into mul- 
tiple pieces, automatic-stack-management code may re- 
quire no changes or far less intrusive changes. If the 
local state of a function does not depend on the yielding 
behavior of a called function, then the calling function 
requires no change. If the calling function’s local state is 
affected, the function must be modified to revalidate its 
State; this surgery 1s usually local and does not require 
substantial code restructuring. 


4 Hybrid approach 


In our project there are passionate advocates for each of 
the two styles of stack management. There is a hybrid 
approach that enables both styles to coexist in the same 
code base, using adaptors to connect between them. This 
hybrid approach also enables a project to be written in 
one style but incorporate legacy code written in the other. 


In the Windows operating system, “threads” are sched- 
uled preemptively and “fibers” are scheduled coopera- 
tively. Our implementation achieves cooperative task 
management by scheduling multiple fibers on a single 
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thread; at any given time, only one fiber is active. 


In our design, a scheduler runs on a special fiber called 
MainFiber and schedules both manual stack manage- 
ment code (event handlers) and automatic stack manage- 
ment code. Code written with automatic stack manage- 
ment, that expects to block for I/O, always runs on a fiber 
other than MainFiber; when it blocks, it always yields 
control back to MainFiber, where the scheduler se- 
lects the next task to schedule. Compute-only functions, 
of course, may run onany fiber, since they may be freely 
called from either context. 


Both types of stack management code are scheduled by 
the same scheduler because the Windows fiber package 
only supports the notion of explicitly switching from one 
fiber to another specified fiber; there is no notion of a 
generalized yi eld operation that invokes a default fiber 
scheduler. Implementing a combined scheduler also al- 
lowed us to avoid the problem of having two, potentially 
conflicting, schedulers running in parallel: one for event 
handlers and one for fibers. 


There are other ways in which the two styles of code 
can be made to interact. We aimed for simplicity and to 
preserve our existing code base that uses manual stack 
management. Our solution ensures that code written in 
either style can call a function implemented in the other 
style without being aware that the other stack manage- 
ment discipline even exists. 


To illustrate the hybrid approach, we show an ex- 
ample that includes calls across styles in both di- 
rections. The example involves four functions: 
FetchCert, GetCertData, VerifyCert, and 
GetCAInfo. (GetCAInfo was introduced in Sec- 
tion 3.1). FetchCert fetches a security certificate us- 
ing GetCertData and then calls VerifyCert in or- 
der to confirm its validity. VerifyCert, in turn, calls 
GetCAInf£o in order to obtain a CA with which to ver- 
ify a certificate. Here is how the code would look with 
serial task management: 


bool FetchCert (User user, 
Certificate «*«cert) { 
// Get the certificate data from a 
// function that might do I/O 
certificate = GetCertData(user) ; 
if (!VerifyCert(user, cert)) { 
return false; 


J 
} 


bool VerifyCert (User user, 


Certificate +#cert) { 
// Get the Certificate Authonty (CA) 
// information and then verify cert 
ca = GetCAInfo(cert) ; 
1£ (ca == NULL) return false; 
return CACheckCert (ca, 


user, cert); 


Certificates GetCertData(User user) { 
// Look up certificate in the memory 
// cache and return the answer. 
// Else fetch from disk/network 
1£ (Lookup(user, cert) ) 
return certificate; 
certificate = DolOAndGetCert (); 
return certificate; 


Of course, we want to rewrite the code to use coop- 
erative task management, allowing other tasks to run 
during the I/O pauses, with different functions adher- 
ing to each form of stack management. Suppose that 
VerifyCert is written with automatic stack man- 
agement and the remaining functions (FetchCert, 
GetCertData, GetCAInfo) are implemented with 
manual stack management (using continuations). We 
will define adaptor functions that route control flow be- 
tween the styles. 


4.1 Manual calling automatic 


Figure 2 is a sequence diagram illustrating how code 
with manual stack management calls code with auto- 
matic stack management. In the figure, the details of 
a call in the opposite direction are momentarily ob- 
scured behind dashed boxes. The first event handler 
for FetchCert1 calls the function GetCertDatal, 
which initiates an I/O operation, and the entire stack 
unrolls in accordance with manual stack management. 
Later, when the I/O reply arrives, the scheduler executes 
the GetCertData2 continuation, which “returns” (by 
a function call) to the second handler for FetchCert. 
This is pure manual stack management. 


When a function written with manual stack management 
calls code with automatic stack management, we must 
reconcile the two styles. The caller code is written ex- 
pecting never to block on I/O; the callee expects to block 
l/O always. To reconcile these styles, we create a new 
fiber and execute the callee code on that fiber. The caller 
resumes (to manually unroll its stack) as soon as the first 
burst of execution on the fiber completes. The fiber may 
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Figure 2: GetCertData, code with manual stack man- 
agement, calls VerifyCert, a function written with 
automatic stack management. 


run and block for I/O several times; when it finishes its 
work on behalf of the caller, it executes the caller’s con- 
tinuation to resume the caller’s part of the task. Thus, 
the caller code does not block and the callee code can 
block if it wishes. 


In our example, the manual-stack-management func- 
tion FetchCert2 calls through an adapter to the 
automatic-stack-management function VerifyCert. 
FetchCertz2 passes along a continuation pointing at 
FetchCert3 so that it can eventually regain control 
and execute the final part of its implementation. The fol- 
lowing code is for the CFA adaptor, ripped into its call 
and return parts; CFA stands for “Continuation-To-Fiber 
adaptor.” 


void VerifyCertCFA(CertData certData, 
Continuation *callerCont) { 
// Executed on MainFiber 
Continuation *vcaCont = new 
Continuation (VerifyCertCFA2, 
callerCont) ; 
Fiber *verifyFiber = new 
VerifyCertFiber(certData, vcaCont); 
// On fiber venfyFiber, start executing 
// VerifyCertFiber::FiberStart 
SwitchToFiber (verifyFiber) ; 
// Control returns here when 
// verifyFiber blocks on YO 


void VerifyCertCFA2 (Continuation 
*vcaCont) { 
// Executed on MainFiber. 
// Scheduled afiter verifyFiber is done 
Continuation *callerCont = 
(Continuation*) vcaCont—>argl; 
callerCont—>returnValue = 
vcaCont—>returnValue; 
// “return” to onginal caller (FetchCert) 
(*callerCont—>function) (callerCont) ; 


The first adaptor function accepts the arguments 
of the adapted function and a continuation (“stack 
frame”) for the calling task. It constructs its own 
continuation vcaCont and creates a object called 
verifyFiber that represents a new fiber (VerifyC- 
ertFiber is a subclass of the Fiber class); this object 
keeps track of the function arguments and vcaCont 
so that it can transfer control to VerifyCertCFA2 
when verifyFiber’s work is done. Finally, it 
performs a fiber-switch to verifyFiber. When 
verifyFiber begins, it executes glue routine 
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VerifyCertFiber::FiberStart to unpack the 
parameters and pass them to VerifyCert, which may 
block on I/O: 


VerifyCertFiber::FiberStart() { 
// Executed on a fiber other than MainFiber 
// The following call could block on YO. 
// Do the actual venfication. 
this—>vcaCont—>returnValue = 

VerifyCert (this—>certData) ; 

// The verification is complete. 
// Schedule VerifyCertCFA2 
scheduler—>schedule(this—>vcaCont) ; 
SwitchTo(MainFiber) ; 


This start function simply calls into the func- 
tion VerifyCert. At some point, when 
VerifyCert yields for I/O, it switches control 
back to the MainFiber using a SwitchTo call 
in the I/O function (not the call site shown in the 
FiberStart() routine above). Control resumes 
in VerifyCertCFA, which unrolls the continuation 
stack (1.e.,, GetCertData2 and FetchCert2) 
back to the scheduler. Thus, the hybrid task has 
blocked for the I/O initiated by the code with automatic 
stack management while ensuring that event handler 
FetchCert2 does not block. 


Later, when the I/O completes, verifyFiber 1s 
resumed (for now, we defer the details on how 
this resumption occurs). After VerifyCert has 
performed the last of its work, control returns to 
FiberStart. FiberStart stuffs the return value 
into VerifyCertCFA2's continuation, schedules it to 
execute, and switches back to the MainFiber a final 
time. At this point, verifyFiber is destroyed. When 
VerifyCertCFA2 executes, it “returns” (with a func- 
tion call, as code with manual stack management nor- 
mally does) the return value from VerifyCert back 
to the adaptor-caller’s continuation, FetchCert3. 


4.2 Automatic calling manual 


We now discuss how the code interactions occur when a 
function with automatic stack management calls a func- 
tion that manually manages its stack. In this case, the 
former function needs to block for I/O, but the latter 
ftunction simply schedules the I/O and returms. Torecon- 
cile these requirements, we supply an adaptor that calls 
the manual-stack-management code with a special con- 
tinuation and relinquishes control to the MainFiber, 
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Figure 3: VerifyCert, code with automatic stack 
management, calls GetCAInfo, a function written with 
manual stack management. 
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thereby causing the adaptor’s caller to remain blocked. 
When the I/O completes, the special continuation runs 
on the MainFiber and resumes the fiber of the blocked 
adaptor, which resumes the original function waiting for 
the I/O result. 


Figure 3 fills in the missing details of Figure 2 to illus- 
trate this interaction. In this example, VerifyCert 
blocks on I/O when it calls Get CAInfo, a function 
with manual stack management. VerifyCert calls 
the adaptor GetCAInfoFCA, which hides the manual- 
stack-management nature of Get CAInfo (FCA means 
Fiber-to-Continuation Adaptor): 


Boolean GetCAInfoFCA(CAID caid) { 
// Executed on verifyFiber 
// Get a continuation that switches control 
// to this fiber when called on MainFiber 
FiberContinuation *cont = new 

FiberContinuation(FiberContinue, 
this); 

GetCAInfo(caid, cont); 

if (!cont—>shortCircuit) { 
// GetCAlnfo did block. 
SwitchTo(MainFiber) ; 


} 


return cont—>returnValue; 


void FiberContinue(Continuation *cont) { 
if (!Fiber::OnMainFiber()) { 
// Manual stack mgmt code did not perform 
// /O: just mark it as short-circuited 
FiberContinuation *fcont = 
(FiberContinuation) 
fcont—>shortCircuit = 
} else { 
// Resumed after I/O: simply switch 
// control to the original fiber 
Fiber *f = (Fiber *) cont—>argl; 
f—>Resume(); 


*cont; 
true; 


The adaptor, GetCAInfoFCA, sets up a special con- 
tinuation that will later resume verifyFiber via the 
code in FiberContinue. It then passes this continua- 
tion to Get CAInfo which initiates an I/O operation and 
returns immediately to what it believes to be the event- 
handling scheduler; of course, in this case, the con- 
trol returns to Get CAInfoFCA. Since I/O was sched- 
uled and short-circuiting did not occur (discussed later 
in this section), GetCAInfoFCA must ensure that con- 
trol does not yet return to VerifyCert; to achieve this 
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effect, it switches control to the MainFiber. 


On the MainFiber, the continuation code that 
started this burst of fiber execution, VerifyCertCFA, 
returns Several times to unroll its stack and the sched- 
uler runs again. Eventually, the I/O result arrives and the 
scheduler executes GetCAInfo2, the remaining work 
of GetCAInfo. GetCAInfo?2 fills the local hash ta- 
ble (recall its implementation from Section 3.1) and “re- 
turns” control by calling a continuation. In this case, 
it calls the continuation (FiberContinue) that had 
been passed to GetCAInfo. 


FiberContinue notices that verifyFiber has 
indeed been blocked and switches control back to 
that fiber, where the bottom half of the adaptor, 
GetCAInfoFCA, extracts the return value and passes it 
up to the automatic-stack-management code that called 
it(VerifyCert). 


The short circuit branch not followed in the example 
handles the case where GetCAInfo returns a result 
immediately without waiting for I/O. When it can do 
so, it must not allow control to pass to the scheduler. 
This is necessary so that a caller can optionally deter- 
mine whether or not a routine has yielded control and 
hence whether or not local state must be revalidated. 
Without a short circuit path, this important optimiza- 
tion and an associated design pattern that we describe 
in Section S cannot be achieved. Figure 4 illustrates 
the short-circuit sequence: The short-circuit code de- 
tects the case where Get CAInfo runs locally, performs 
no I/O, and executes (“returns to”) the current contin- 
uation immediately. FiberContinue detects that it 
was not executed directly by the scheduler, and sets 
the shortCircuit flag to prevent the adaptor from 
switching to the MainFiber. 


4.3. Discussion 


An important observation Is that, with adaptors in place, 
each style of code is unaware of the other. A ftunction 
written with automatic stack management sees what it 
expects: deep in its stack, control may transfer away, 
and return later with the stack intact. Likewise, the 
event-handler scheduler cannot tell that it is calling any- 
thing other than just a series of ordinary manual-stack- 
management continuations: the adaptors deftly swap the 
fiber stacks around while looking like any other continu- 
ation. Thus, integrating code in the two styles 1s straight- 
forward: fiber execution looks like a continuation to the 
event-driven code, and the continuation scheduler looks 
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Figure 4: A variation where Get CAInfo does not need 
to perform I/O. 


like any other fiber to the procedure-oriented code. This 
adaptability enables automatic-stack-management pro- 
grammers to work with manual-stack-management pro- 
grammers, and to evolve a manual-stack-management 
code base with automatic-stack-management functions 
and vice versa. 


5 Implementation Experience 


We have employed cooperative task management in 
two systems: Farsite [BDETOO], a distributed, se- 
cure, serverless file system running over desktops, and 
UCoM [SBS02], a wireless phone application for hand- 
held devices such as PDAs. The Farsite code is designed 
to run as a daemon process servicing file requests on the 
Windows NT operating system. The UCoM system, de- 
signed for the Windows CE operating system, is a client 
application that runs with U] and audio support. 


The Farsite system code was initially written in event- 
driven style (cooperative task management and manual 
stack management) to enable simplified reasoning about 
the concurrency conditions of the system. As our code 
base grew and evolved over a period of two years, we 
came to appreciate the costs of employing manual stack 
management and devised the hybrid approach discussed 
in the previous section to introduce automatic stack man- 
agement code into our system. The UCoM system uses 
automatic stack management exclusively. 


Farsite uses fibers, the cooperative threading facility 
available in Windows NT. With Windows fibers, each 
task’s state is represented with a stack, and control is 
transferred by simply swapping one stack pointer for 


another, as with setjmp and longjmp. Since fibers 
are unavailable in the Windows CE operating system, 
UCoM uses preemptive threads and condition variables 
to achieve a cooperative threading facility: each thread 
blocks on its condition variable and the scheduler en- 
sures that at most one condition variable is signalled at 
any moment. When a thread yields, it blocks on its con- 
dition variable and signals the scheduler to continue; the 
scheduler selects aready thread and signals its condition 
variable. 


We implemented the hybrid adaptors in each direction 
with a series of mechanically-generated macros. There 
are two groups of macros, one for each direction of adap- 
tation. Within each group, there are variations to ac- 
count for varying numbers of arguments, void or non- 
void return type, and whether the function being called 
is a Static function or an object method; multiple macros 
are necessary to generate the corresponding variations in 
syntax. Each macro takes as arguments the signature of 
the function being adapted. The macros declare and cre- 
ate appropriate Fiber and Continuation objects. 


Our experience with both systems has been positive and 
our subjective impression is that we have been able to 
preempt many subtle concurrency problems by using co- 
operative task management as the basis for our work. 
Although the task of wrapping I/O functions (see Sec- 
tion 2.1) can be tedious, it can be automated, and we 
found that paying an up-front cost to reduce subtle race 
conditions was a good investment. 


Both systems use extra threads for converting block- 
ing V/O operations to non-blocking operations and for 
scheduling I/O operations, as is done in many other sys- 
tems, such as Flash [PDZ99]. Data partitioning prevents 
synchronization problems between the I/O threads and 
the state shared by cooperatively-managed tasks. 


Cooperative task management avoids the concurrency 
problems of locks only if tasks can complete without 
having to yield control to any other task. To deal with 
tasks that need to perform YO, we found that we could 
of ten avoid the need for a lock by employing a particular 
design pattern. In this pattern, which we call the Pin- 
ning Pattern, I/O operations are used to pin resources in 
memory where they can be manipulated without yield- 
ing. Note that pinning does not connote exclusivity: a 
pinned resource is held in memory (to avoid the need 
to block on I/O to access it), but when other tasks run, 
they are free to manipulate the data structures it con- 
tains. Functions are structured in two phases: a loop that 
repeatedly tries to execute all potentially-yielding oper- 
ations until they can all be completed without yielding, 
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and an atomic block that computes results and writes 
them into the shared state. 


An important detail of the design pattern is that there 
may be dependencies among the potentially- yielding op- 
erations. A function may need to compute on the results 
of a previously-pinned resource in order to decide which 
resource to pin next; for example, in Farsite this occurs 
when traversing a path in a directory tree. Thus, in the 
fully general version of the design pattern, a check after 
each potentially-yielding operation ascertains whether 
the operation did indeed yield, and if so, restarts the loop 
from the top. Once the entire loop has executed with- 
out interruption, we know that the set of resources we 
have pinned in memory are related in the way we ex- 
pect, because the final pass through the loop executed 
atomically. 


6 Related Work 


Birrell offers a good overview of the conventional 
thread-ed programming model with preemptive task 
management [Bir89]. Of his reasons for using concur- 
rency (p. 2), cooperative task management can help with 
all but exploiting multiprocessors, a shortcoming we 
mention in Section 2.1. Birrell advises that “you must 
be fastidious about associating each piece of data with 
one (and only one) mutex” (p. 28); consider coopera- 
tive task management as the limiting case of that advice. 
There is the complexity that whenever a task yields it ef- 
fectively releases the global mutex, and must reestablish 
its invariants when it resumes. But even under preemp- 
tive task management, Birrell comments that “you might 
want to unlock the mutex before calling down to a lower 
level abstraction that will block or execute for a long 
time”’ (p. 12); hence this complexity is not introduced by 
the choice of cooperative task management. 


Ousterhout points out the pitfalls of preemptive task 
management, such as subtle race conditions and dead- 
locks [Ous96]. We argue that his “threaded” model con- 
flates preemptive task management with automatic stack 
management, and his “event-driven” model conflates co- 
Operative task management with manual stack manage- 
ment. We wish to convince designers that the choices 
are orthogonal, that Ousterhout’s arguments are really 
about the task management decision, and that program- 
mers should exploit the ease-of-reasoning benefits of co- 
operative task management while exploiting the features 
of their programming language by using automatic stack 
management. 
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Other system designers have advocated non-threaded 
programming models because they observe that for 
a certain class of high-performance systems, such 
as file servers and web servers, substantial per- 
formance improvements can be obtained by re- 
ducing context switching and carefully implement- 
ing application-specific cache-conscious task schedul- 
ing [HS99, PDZ99, BDM98, MY98]. These factors 
become especially pronounced during high load situa- 
tions, when the number of threads may become so large 
that the system starts to thrash while trying to give each 
thread its fair share of the system’s resources. We ar- 
gue that the context-switching overhead for user-level 
threads (fibers) is in fact quite low; we measured the cost 
of switching in our fiber package to be less than ten times 
the cost of a procedure call. Furthermore, application- 
specific cache-conscious task scheduling should be just 
as achievable with cooperative task management and au- 
tomatic stack management: the scheduler is given pre- 
cisely the same opportunities to schedule as in event- 
driven code; the only difference is whether stack state is 
kept on stacks or in chains of continuations on the heap. 


For the classes of applications we reference here, pro- 
cessing is often partitioned into stages [WCBO1, LPO1]. 
The partitioning of system state into disjoint stages is a 
form of data partitioning, which addresses concurrency 
at the coarse grain. Within each stage, the designer 
of such a system must still choose a form of conflict 
management, task management, and stack management. 
Careful construction of stages avoids I/O calls within a 
stage; in that case, cooperative task management within 
the stage degenerates to serial task management, and no 
distinction arises in stack management. In practice, at 
the inter-stage level, a single task strings through mul- 
tiple stages, and reads as in manual stack management. 
Typically, the stages are monotonic: once a task leaves 
a stage, it never returns. This at least avoids the ripping 
associated with looping control structures. 


Lauer and Needham show two programming models 
to be equivalent up to syntactic substitution [LN79]. 
We describe their models in terms of our axes: their 
procedure-oriented system has preemptive task manage- 
ment, automatic stack management (“‘a process typically 
has only one goal or task”), monitors for conflict man- 
agement, and one big data partition protected by those 
monitors. Their message-oriented system has manual 
stack management with task state passed around in mes- 
sages, and no conflicts to manage due to many partitions 
of the state so that it is effectively never concurrently 
shared. 


Notably, of the message-oriented system, they say “‘nei- 
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ther procedural interfaces nor global naming schemes 
are very useful,” that is, the manual stack management 
undermines structural features of the language. Neither 
model uses cooperative task management as we regard 
it, since both models require identically-detailed reason- 
ing about conflict management. Thus their comparison 
is decidedly not between the models we associate with 
multithreaded and event-driven programming. 


7 Conclusions 


In this paper we clarify an ongoing debate about “‘event- 
driven” versus “threaded” programming models by iden- 
tifying two separable concerns: task management and 
stack management. Thus separated, the paper assumes 
cooperative task management and focuses on issues 
of stack management in that context. Whereas the 
choice of task management strategy is fundamental, the 
choice of stack management can be left to individual 
taste. Unfortunately, the term “event-driven program- 
ming” conflates both cooperative task management and 
manual stack management. This prevents many peo- 
ple from considering using a readable automatic-stack- 
management coding style in conjunction with coopera- 
tive task management. 


Software evolution is an important factor affecting the 
choice of task management strategy. When concurrency 
assumptions evolve it may be necessary to make global, 
abstraction-breaking changes to an application’s imple- 
mentation. Evolving code with manual stack manage- 
ment imposes the cumbersome code restructuring bur- 
den of stack ripping; evolving either style of code in- 
volves revisiting the invariant logic due to changing con- 
currency assumptions and sometimes making localized 
changes to functions in order to revalidate local state. 


Finally, a hybrid model adapts between code with au- 
tomatic and with manual stack management, enabling 
cooperation among disparate programmers and software 
evolution of disparate code bases. 
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Abstract 


Concurrency management is a basic requirement for inter- 
process communication in any multitasking system. This usu- 
ally takes the form of lock-based or other blocking algorithms. 
In real-time and/or time-sensitive systems, the less-predictable 
timing behavior of lock-based mechanisms and the additional 
task-execution dependency make synchronization undesirable. 
Recent research has provided non-blocking and wait-free algo- 
rithms for interprocess communication, particularly in the do- 
main of single-writer, multiple-reader semantics, but these al- 
gorithms typically incur high costs in terms of computation or 
space complexity, or both. In this paper, we propose a gen- 
eral transformation mechanism that takes advantage of tempo- 
ral characteristics of the system to reduce both time and space 
overheads of current single-writer, multiple-reader algorithms. 
We show a 17-66% execution time reduction along with a 14- 
70% memory space reduction when three wait-free algorithms 
are improved by applying our transformation. We present three 
new algorithms for wait-free, single-writer, multiple-reader 
communication along with detailed performance evaluation of 
nine algorithms under various experimental conditions. 


1 Introduction 


A key benefit provided by operating systems is a task 
or thread abstraction to manage the complexity that rapidly 
evolves even in very small embedded systems. A task/thread 
model mitigates the complexity growth of large monolithic pro- 
grams, and simplifies the sharing of computing resources be- 
tween the disparate functions of the system. However, the tasks 
of a system very rarely work independently of each other, hence 
needing interprocess communication (IPC) between tasks. 


The simplest method of IPC is through global, shared vari- 
ables. This is a very low-overhead method of communication, 





*The work reported in this paper is supported in part by the U.S. Air- 
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by DARPA administered under AFRL contract F30602-01-02-0527. 
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but has obvious flaws in concurrent accesses by multiple tasks. 
Even if we restrict the domain to single-writer semantics, which 
is Common in embedded systems and sensor networks, data cor- 
ruption can occur. 


To avoid reading corrupted data from a concurrent object, 
critical sections are often used to coordinate accesses from dif- 
ferent tasks. The simplest approach to implementing critical 
sections is to disallow task preemption inside of the critical 
section. This can be done by disabling and enabling interrupts 
in the CPU at the beginning and end of the critical sections, 
respectively. These are privileged operations and require ker- 
nel intervention. The read and write operations must be imple- 
mented in the kernel, or the application must be wholly trusted, 
since any task running with interrupts disabled cannot be pre- 
empted and may, either maliciously or inadvertently, disrupt the 
system. Moreover, disabling interrupts does not suffice to man- 
age concurrency in multiprocessor systems. 


The most common way to implement critical sections 
is to use software locks — typically through mutexes and 
semaphores. A task has to acquire the necessary locks before 
it can access shared objects. If the needed lock is already held 
by another task, the task blocks, and the operating system will 
resume it when the resource becomes available. Using locks 
serializes concurrent tasks that try to access the shared objects 
simultaneously, thus preventing corruption. In a multiprocessor 
environment, this reduces parallelism and decreases the utiliza- 
tion of available resources. 


Locks can also cause more serious problems such as unpre- 
dictable blocking times and deadlocks. If a task is blocked 
while still holding the lock .g., a page fault occurred, or it 
is preempted by a higher-priority task), any other tasks waiting 
for the lock are unable to make progress until the lock is sub- 
sequently released. In the worst case, the task may fail while 
holding the lock, or block indefinitely due to circular lock de- 
pendencies, causing deadlock and blocking other tasks from 
ever making progress. 


Even with safeguards to avoid deadlock, locks are particu- 


303 


larly unattractive in real-time and embedded systems. Due to 
blocking and switching to other tasks, using locks can incur 
high and unpredictable execution time overheads, and cause 
many other problems, including priority inversion, convoying 
of tasks, more difficult schedulability analysis, and increased 
susceptibility to faults. In real-time systems, tasks are usu- 
ally assigned fixed or deadline-based priorities, according to 
which they are scheduled. Priority inversion can occur when 
a high-priority task is blocked waiting for a lock, but the lock 
holder does not make progress due to its low priority. This is 
such a serious issue that many algorithms have been developed 
to limit the effects of priority inversion, including the priority 
inheritance protocol, the priority ceiling protocol, and the im- 
mediate priority ceiling protocol [3,28, 29]. Furthermore, pro- 
viding real-time execution guarantees becomes more difficult. 
The simple, classical real-time analysis techniques [21] assume 
independently-executing tasks, which is clearly violated when 
locks are used. More complex analysis [29] may be used to pro- 
vide real-time guarantees by accounting for worst-case block- 
ing times, but this may result in poorer utilization of system 
resources. 


Due to the above problems associated with lock-based syn- 
chronization IPC approaches, several algorithms that perform 
non-blocking and wait-free! communication with single-writer, 
multiple-reader semantics have been proposed. These allow 
tasks to independently access the shared message area without 
locks and the problems introduced by blocking. These algo- 
rithms, however, are not perfect. Although blocking 1s avoided, 
the operations may become quite complex and can incur non- 
negligible computational overheads. More importantly, the al- 
gorithms all use multiple buffers to avoid corruption, so their 
space overhead is high, wasting memory resources that are 
severely limited in small, embedded systems. 


In this paper, we present three new wait-free algorithms. We 
develop a generalized transformation mechanism that can im- 
prove existing wait-free algorithms by exploiting the temporal 
characteristics of communicating tasks, significantly reducing 
both space and execution time overheads. For some existing 
algorithms, we show up to 66% reduction in execution time 
and 70% reduction in memory requirements after applying our 
transformation. The transformed algorithms preserve all of the 
benefits of wait-free communication along with significant time 
and space savings. 


In the following section, we present some background infor- 
mation and further motivate this work. We present our trans- 
formation mechanism in Section 3, and illustrate it using some 
actual IPC algorithms. Detailed evaluations are done in Section 
4. We will put our work in the perspective of related work in 


! A concurrent object implementation is non-blocking if at least one process 
that is accessing the object can complete an operation within a finite number of 
steps regardless of failures. Furthermore, it is wait-free if every process that is 
accessing the object can complete an operation within a finite number of steps 
[13]. Wait-free is a stronger form of non-blocking as it ensures starvation-free 
access. 
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Figure 1: A schematic block diagram of a real-time system. 


Section 5, before concluding in Section 6. 


2 Motivation 


In this paper, we are primarily concerned with communica- 
tion between a single writer and multiple readers. This is a 
very common scenario in embedded systems — ranging from 
as complex as automotive and industrial control systems to as 
simple as the controllers in kitchen appliances. Figure 1 shows 
a typical real-time system. The sensors are used to acquire in- 
formation from the controlled system. A sensor task reads the 
data, performs any preprocessing, and distributes the informa- 
tion to the various control tasks. The control tasks perform 
computations and set the actuators based on this information, 
so it 1s important that they obtain uncorrupted, most-recently 
produced data from the sensor task. 


Traditionally, the writer (1.e., sensor task) must pass the data 
to the readers (1.e., control tasks) by means of mailboxes, one 
of which is associated with each reader. However, if there is 
a large disparity in the execution frequencies of the tasks, es- 
pecially if the sensor read rate is higher than the actuator con- 
tro] output rates, as is common, data messages will queue up 
in the mailboxes. The reader will] obtain outdated messages, 
and will either have to process these or discard them to acquire 
the most current information. Generating multiple copies of 
each message incurs overheads in processor cycles and mem- 
ory space, both of which are scarce resources in an embedded 
system. Therefore, the mailbox approach is neither appropriate 
nor efficient for typical IPC needed in real-time and embedded 
systems. 


State messages are used to alleviate such problems. They 
were proposed in the MARS project {16] and implemented in 
ERCOS [25]. The state messages approach associates mail- 
boxes with the writer instead of the readers, so only the writer 
associated with a particular mailbox can write to it. Further- 
more, each message 1s assumed to include al] data that needs to 
be communicated, so that the single, most current message con- 
veys all information. Since data are time-sensitive, a new mes- 
Sage can simply overwrite the previous one, effectively present- 
ing the readers with the most up-to-date information. However, 
since the writer and readers can access the writer’s mailbox 
concurrently, the readers can potentially read corrupted data if 
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Figure 2: Reader and writer execution timelines, and each x denotes 
a write operation performed by the writer. 


the writer simultaneously writes new data. 


There are many synchronization-based algorithms [9, 10] 
designed to ensure that reader tasks will always access un- 
corrupted messages. As mentioned earlier, synchronization, 
particularly with locks, can cause many problems of its 
own. Therefore, in this paper, we focus on wait-free, single- 
writer, multiple-reader IPC algorithms [7, 8, 17,24,31]. How- 
ever, these algorithms have higher space overheads than the 
synchronization-based algorithms. Even though the worst-case 
time overhead of these algorithms is significantly lower than 
that of the synchronization-based ones, the execution overheads 
can still be significant. Later in this paper, we present a transfor- 
mation mechanism that takes advantage of the real-time prop- 
erties of the communicating tasks to reduce both the time and 
space overheads of this class of algorithms. First, however, we 
present a brief overview of real-time systems and tasks in the 
next section. 


2.1 Attributes of Real-Time Tasks 


Tasks in a typical real-time system are periodically in- 
voked/released and executed.” Each task T is associated with 
various attributes, including its period P, relative deadline D,? 
and worst-case execution time (WCET) C’. The task must be 
run once each period, and needs to receive enough process- 
ing time to complete execution by its relative deadline. The 
real-time scheduler uses these attributes to decide when to run 
tasks, and can guarantee that all tasks will meet their deadlines 
as long as they require no more than their specified WCETs. 
From high-level program flow analysis and low-level timing 
information, a task’s WCET can be determined statically. Fig- 
ure 2 shows the relationship between these values for a typ- 
ical scenario with one reader and one writer processes. The 
top timeline represents the reader’s period. For simplicity, the 
reader’s relative deadline is assumed to be equal to its period in 
our discussion and not shown here. In general, it is less than 
or equal to Pr, where Pp is the reader’s period. C’' denotes the 
reader’s WCET, and C'p is the time to perform a read opera- 


2 Aperiodic tasks can be handled by a periodic server [18], so the periodic 
task model is not a limiting assumption. 
3This equals the deadline minus the release time of the task. 


tion. Ragaz represents the maximum time the reader can take 
to perform a read operation. Note in Figure 2 that the read op- 
eration is placed at the end of the reader task’s execution. It is 
only drawn there to show the relationship between Rpyaz, CR, 
Pr and C more clearly, but, in general, the read operation can 
be anywhere within the reader’s execution time C’. The bottom 
timeline represents 4 writer periods. The writer’s period and 
relative deadline are denoted by Pw and Dy, respectively. 


2.2 Temporal Concurrency Control 


Since Ryyez includes the time the reader is preempted by 
higher-priority tasks, it determines the maximum time the 
writer process may interfere with the reader within the reader’s 
period without the reader missing its deadline. Rysgz is calcu- 
lated as follows: 


Rmazr = PrR- (C “z Cr). 


Assuming that all deadlines are met, Figure 2 illustrates the 
worst-case scenario in terms of the maximum number of pre- 
emptions of the reader by the writer task. This occurs when 
the first interfering-write happens as late as possible within 
the writer’s period (first vertical dotted line — just before the 
writer’s deadline) and the last interfering-write happens as early 
as possible within the writer’s period (second vertical dotted 
line — just after the writer is released). 


Let Naver denote the maximum number of times the writer 
might interfere with the reader process during a read operation. 
NwMaz can be calculated as: 

1), 


Rez — (Pw — Dw) 
Pw 

Therefore, if we use an (Nyyeaz + 1)-deep circular buffer in- 
stead of a single message buffer, the writer can post messages 
cyclically without ever interfering with the reader process, as- 
suming that the real-time constraints are met. This allows the 
reader and writer to access the message area independently of 
each other without blocking, using only temporal characteris- 
tics guaranteed by the real-time scheduling and a sufficiently- 
deep circular buffer to manage concurrency. With multiple 
readers, we simply choose an Nyyq, value largeenough to work 
for all readers, i.e., compute it using the task with largest Rayez. 
Finally, we keep a pointer to the most recently written mes- 
sage. This is updated by the writer, and subsequently used by 
the readers to retrieve the latest message. This concept was first 
introduced in [16] and later implemented in the Non-Blocking 
Write (NBW) protocol [17]. 


This algorithm is very efficient in terms of execution time, 
i.e., almost as fast as using global variables with no protection. 
The only overhead associated with this algorithm is the cost of 
maintaining the pointer for the most recently written message. 
Therefore, it is easy to see that it has optimal timing behavior 
among wait-free algorithms. 


NMmaz — MAL (2, 
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2.3 Restricting Memory Use 


With a deep enough buffer, the above algorithm will always 
guarantee that the readers will not acquire corrupted data. How- 
ever, when Riyaz is large or Py is small, Nagaz can get quite 
large and would require a large buffer space. This is undesir- 
able, especially in embedded systems where memory is usually 
a scarce resource. 


EMERALDS’s state message algorithm [35] improves upon 
the NB W protocol. To limit memory usage, EMERALDS sim- 
ply sets a static maximum buffer threshold for the state mes- 
sage. The reader tasks are divided into two groups, fast and 
slow readers. Tasks that have Njy,- values less than this max- 
imum buffer threshold are classified as fast readers, while the 
others are classified as slow readers. 


The fast readers execute according to the NBW protocol. 
Since these readers have small Naya, values, they are both 
time- and space- efficient. For slow readers, EMERALDS pro- 
vides a system call mechanism that (1) disables interrupts, (11) 
copies the message from the shared buffer to the slow reader’s 
local space on behalf of the reader, and (111) re-enables inter- 
rupts. The overhead of this system call 1s quite high; however, 
according to the definition of slow readers, this call is invoked 
relatively infrequently, so it was claimed not to greatly impact 
the overall average-case execution time overheads. 


As we will see in Section 4, the amount of overhead due to 
this system call is significant enough to make its average-case 
execution time much higher than the non-blocking algorithms. 
We would like to have the low execution overheads of the NBW 
protocol and the low memory usage achieved by the EMER- 
ALDS implementation, but without resorting to locks, disabled 
interrupts, or other synchronization-based concurrency control 
mechanisms. The following section details how to achieve this 
by transforming existing wait-free IPC mechanisms. 


3 Improving Wait-Free IPC 


In order to gain the benefits of wait-free IPC along with low 
memory usage, and low average- and worst- case execution 
times, we first generalize the concept of fast and slow read- 
ers (to reduce the memory requirements) introduced in EMER- 
ALDS. We then devise a transformation mechanism that can be 
applied to existing wait-free algorithms, preserves all of their 
inherent benefits, and simultaneously improves their perfor- 
mance. 


Here, fast readers are defined as those tasks for which tem- 
poral concurrency control suffices to ensure uncorrupted reads 
without excessive memory usage. Slow readers consist of all of 
the other reader tasks, which would require too much memory 
to employ temporal concurrency control alone. The actual di- 
vision of tasks would depend on the requirements of the final 
system, as we will see later. 


We can transform IPC algorithms to use this concept of fast 
and slow readers. The fast readers will basically employ the 
NBW read mechanism, and will require sufficient buffers to 
ensure temporal concurrency control. The slow readers will 
use the existing IPC mechanism, although slight changes may 
be required because of the parallel approach employed by the 
fast readers. The writer requires more significant changes in 
order to interact with both types of readers. The precise nature 
of these changes depends on the actual algorithm transformed. 


In general, we can make some predictions about the resulting 
performance. First, the average-case execution time (ACET) 
will decrease, since the highest-frequency readers will use the 
very efficient NBW mechanism. Worst-case execution time 
(WCET) is also often reduced, since for most algorithms, ex- 
ecution time depends on the number of simultaneous readers 
using the mechanism, which is reduced to only the slow read- 
ers. With the proper division of tasks into fast and slow readers 
(Section 3.4), the transformed algorithm should require much 
less memory on average than the original algorithm, and in the 
worst case, require no more than the original. 


Our transformation mechanism can be illustrated more con- 
cretely by showing how we apply it to some actual algorithms. 
We first apply our transformation to the algorithm proposed by 
Chen et al. in [7]. We then show how to transform the Dou- 
ble Buffer algorithm, which we have developed and present in 
Section 3.2. Chen’s algorithm has a relatively high execution 
time overhead and low space overhead, so we expect our trans- 
formation to primarily improve execution time. In contrast, the 
Double Buffer algorithm has a high space overhead and low 
execution time overhead. We expect this algorithm to bene- 
fit primarily from memory usage reduction after transforma- 
tion. The following subsections detail the improved algorithms, 
which are evaluated in Section 4. 


3.1 Improving Chen’s Algorithm 


Chen et al. [7] proposed a single-writer, multiple-reader 
wait-free algorithm using the Compare-And-Swap (CAS) in- 
struction. This instruction is used to atomically modify the 
states of control variables used to ensure that the writer never 
writes to a buffer currently in use by some readers. The CAS 
instruction is commonly used in non-blocking algorithms to co- 
ordinate accesses to shared buffers and is supported on most 
modern microprocessors. Even if an architecture does not sup- 
port this instruction, it can be synthesized by using other system 
primitives or system support [5]. The instruction CAS(A,B,C) 
is defined to be equivalent to atomically executing “if A equals 
B, then set A to C and return true, else return false.”’ 


Chen’s algorithm requires (P + 2) message buffers, where 
P is the number of reader tasks. There is a global variable, 
Latest, that indexes to the most recently written message 
buffer. Additionally, each reader has an entry in a usage ar- 
ray indicating the buffer it is using. When the reader reads, it 
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int NSReader; # Number of slow readers 
int NBuffier; # Number of buffers 

int Latest; # Index to the Jatest message 
message Buff[NBuffer]; | # Message buffer 

char Reading[NSReader], # Usage count 


SlowReader; () { 


I: Reading[i] = NBuffer; 
Be ridx = Latest; 
3: CAS( Reading[i}, NBuffer, ridx ); 
4: ridx = Readingfi); 
Sf read Buff{rdx]; 

} 

int GetBuffiC) { 

boolean InUse[NBuffer], 

6 for (i = 0; i < NBuffier; i++) InUse[i] = false; 
7 InUsefLatest] = tue; 
8: for (i = 0; i < NSReader; i++) { 
9: j = Reading{i); 
10: if (| # NBuffer) InUse[j] = tue; 
11: 
iZ for (i = ((Latest + 1) mod NBuffer); ; 
Fs: i= ((i+ 1) mod NBuffer)) { 
14: if (InUse[i] == false) 
15: return i; 

Writer() { 


16: widx = GetBuff 0; 

17: write Buff[widx]; 

18: Latest = widx; 

19: for (i = 0; 1 < NSReader; i++) 

20: CAS(Readingfi], NBuffier, widx); 


Figure 3: Improved Chen’s Algorithm. 


first clears its entry, and then uses CAS to atomically set this to 
Latest if it is still cleared. It then reads back the value from 
its entry, and can then safely read from the indicated buffer. 
The writer has slightly more work to do. It first scans the usage 
array and selects a free buffer. It performs the write, updates 
Latest, and then must scan and set each reader entry that is 
cleared to Latest using CAS. This has been proven to ensure 
correct non-blocking IPC behavior in [7]. 


By taking into account the real-time properties of the com- 
municating tasks, we can divide the reader set into two sets: 
fast and slow reader sets. By separating the reader set, we can 
reduce the space requirement from P + 2 to M + maz(2, N), 
where M is the number of slow readers and N is the number of 
buffers needed by the fast readers. Section 3.4 describes how to 
compute M and N in order to optimize for space. Because NV 
is chosen to be less than, or equal to, the number of fast readers 
(i.e., N < P — M), the improved algorithm requires no more 
buffer space than the original algorithm. In the worst case (1.e., 
all readers are slow readers), the improved algorithm simply de- 
generates to the original algorithm. Furthermore, the execution 
time overheads will be greatly reduced, since fast readers use 
the very efficient NBW mechanism and the writer overhead is 
linear to the number of slow readers only, rather than all read- 
ers. Therefore, both space and time overheads can be reduced. 


The Improved Chen’s algorithm is shown in Figure 3. 
NSReader is the number of slow readers. NBuffer is the 
total number of message buffers. Buff [] is the array of mes- 
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sage buffers shared between the writer and readers. Latest is 
acontrol variable that indexes this array, indicating the most re- 
cently written message buffer. Reading[] is the usage array 
associated with the slow readers such that Reading [z] indi- 
cates which buffer entry the i*” slow reader is currently reading. 


The slow readers operate identically to the readers in Chen’s 
algorithm. Just before the i*" slow reader reads from the mes- 
sage buffer, Reading[z] is set to a value between O and 
NBuf fer-i to indicate the index of the buffer it will be read- 
ing. The writer will not overwrite this buffer slot as long 
as the slow reader is still using it. The slow reader first as- 
signs Reading [1]=NBuffer to indicate that it is preparing 
to make aread operation. Then, it reads Latest, and attempts 
to set Reading [7] to this value atomically using CAS. If the 
writer has preempted the reader and completed a buffer write 
before this instruction, it would have already set Reading [7] 
to the new Latest value, and the reader’s CAS would fail. 
In any case, by line 3, Reading [2] would have been atomi- 
cally set to a buffer index that the writer will not use. So the 
slow reader simply reads the index and can now read from the 
indicated buffer safely. 


The fast reader (not shown) is the same as in the NBW pro- 
tocol. It relies only on temporal concurrency control, so it just 
reads Latest and uses the indicated buffer. 


The Writer() process looks just like the one in Chen’s 
algorithm. It calls Get Buff () to determine which buffer slot 
is safe to use next. After it writes the next message, it updates 
Latest and then modifies each Reading [z] using CAS if 
necessary. 


The key difference lies in GetBuff() function, which is 
modified to allow temporal concurrency control for fast read- 
ers. First, to prevent the writer from interfering with slow read- 
ers, Get Buff () picks a buffer, m, such that no slow reader is 
using it (i.e., for all 4, Reading [i] # m). To protect the fast 
readers, as with the NBW protocol, we must ensure that there 
are at least (NV — 1) writes between two consecutive writes to 
any particular buffier, where JV is the buffer depth required for 
temporal concurrency control (Section 2.2). Get Buff () pre- 
vents the writer from interfering with the fast readers by cycli- 
cally choosing buffer entries starting from Latest. When 
NBuffer is chosen correctly (Section 3.4), even if each slow 
reader is using a unique buffer, there will be enough buffers 
(i.e., NBuffer — NSReader) left so that the cyclic selec- 
tion will ensure sufficient time between two consecutive writes 
to the same buffer, satisfying the requirements for temporal 
concurrency control. Thus, the writer will not interfere with 
either fast or slow readers. 


Let us illustrate this using the example shown in Figure 4. 
Assume that there are 20 readers, of which 3 are identified as 
slow readers. Assume further that relative execution frequen- 
cies of the fast readers and the writer are such that they require 
a 4-deep buffer to ensure temporal concurrency control. In this 
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Buff[M+Nj Reading [M] Buff [M+N] Reading [M] 


| M: number of slow readers 
N: number of bulfiers for fast readers 


(a) (b) 





Figure 4: An example for Improved Chen’s algorithm. 


system, therefore, we need 7 message buffers (4 for the fast 
readers, and 1 for each of the slow readers), as compared to 22 
buffers needed with the original Chen’s algorithm. Figure 4(a) 
shows a particular execution state of the task set with Latest 
points to the 4" buffer slot. Since Reading[0] and Read- 
ing[2] point to the 4“ and 5‘* buffer slots, the writer knows 
these may be in use, and will not use these buffers. Instead, 
it will cyclically select and write to the next available slot af- 
ter Latest, the 6" buffer. The worst-case scenario occurs 
when the last slow reader now makes a read operation. It will 
now prevent the writer from using the 6‘" buffer. Even if the 
three slow readers never relinquish their buffers, the writer can 
continue to write cyclically to the remaining 4 buffers, with the 
repeating access pattern {7, 1, 2, 3, 7, ... }. This ensures that 
no buffer is used more frequently than every fourth write, satis- 
fying the conditions for the fast readers. 


The biggest drawback of Chen’s algorithm lies in the com- 
plexity of the GetBuff() function and the expensive CAS 
instruction itself. As shown in Figure 3, there are three loops 
inside of this function. The first one loops NBuffer times, 
and the second one loops NSReader times. Finally, the last 
One can potentially loop NBuffer times again. Furthermore, 
the writer has a loop that executes CAS NSReader times. As 
the number of slow readers decreases, we expect the perfor- 
mance enhancement from the Improved Chen's algorithm, as 
compared to the original Chen’s algorithm. 


3.2 Double Buffer Algorithm 


We have devised a new wait-free IPC mechanism that is less 
computationally complex than Chen’s algorithm. It, however, 
trades off time for space complexity, requiring approximately 
twice the buffer space. Hence, it is called the Double Buffer 
algorithm. 


The basic constructs of the Double Buffer algorithm are 
shown in Figure 5, and the algorithm is summarized in Fig- 
ure 6. A two-dimensional shared message buffer, Buff [Jl], 
has (P + 1) rows, where FP is the number of reader tasks. Each 
row has two buffers. Associated with each row 2 Is a usage 
count, ReaderCnt[z], representing the number of readers 
currently using either buffer in the row, and a flag, C1 [2], in- 
dicating which of the two buffers is more current. A variable, 


Buff (P +t) [2] CL[(P+t] ReaderCne (P +1) 


ont n eee 





P: number of readers 


Figure 5: Constructs in the Double Buffer algorithm. 


Latest, points to the row containing the mostrecently written 
data. A reader task first reads Latest, and indicates it is us- 
ing the row by incrementing the usage count. It then reads the 
buffer indicated by the row’s C1 flag, and decrements the row’s 
usage count when it finishes reading. Note that the increment 
and decrement operate directly on memory variables and must 
be atomic. This is commonly available on modern processors, 
including the x86 architecture. 


The writer is fairly straightforward. It first scans Reader- 
Cnt [], and selects a row that is not being used by the readers. 
It then wmites to the buffer that was least recently written in the 
selected row (i.e., opposite to the one indicated by the row’s 
Cl flag). We will see why this is necessary shortly. Finally, it 
updates the row’s Cl flag to point to the newly-written buffer, 
and sets Latest to the row that contains this buffer. In case 
each reader is concurrently reading from a unique row, this al- 
gorithm requires (P + 1) rows for the writer to work correctly, 
where P is the number of readers. As each row has 2 buffers, 
the space required for the message buffer array is 2(P + 1). 


To see the correctness of the algorithm, let us consider the 
possible interference scenarios. The writer can only interfere 
with the reader when they both choose to use the same row. 
This can only occur in two cases. The first case can occur when 
a reader is interrupted after it has chosen a row (after line 1), 
but before it updates the use count (before line 2). The writer 
then executes, and can potentially choose the same row as the 
reader. The second case occurs when the writer is interrupted 
after it has chosen a row (after line 7). If this row happens to be 
Latest, then the reader can also choose to read from this same 
row. So, it is possible for the readers and the writer to select 
the same row 71. However, the reader will read from the buffer 
indicated by C1 [7] , while the writer will use the opposite one. 
As the writer updates C1 [7] only after the complete message 1s 
written, and the reader always increments the use count before 
reading C1 [2], we can guarantee that the writer and readers 
cannot interfere with each other in this algorithm, even if they 
happen to use the same row. 


The Double Buffer algorithm is less computationally com- 
plex than Chen’s algorithms, but has a space requirement twice 
that of the original Chen’s algorithm. In the next section, we 
use our transformation technique to improve the Double Buffer 
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int NReader; # Number of readers 
int NRows = NReader +1, # Number of rows in the message buffer 
int Latest; # Index to the row with the latest message 


message Buff{[NRows][2]: 
int ReaderCnt[NRows]; 
boolean CI{NRows]; 


# Message buffer 
# Reader count for each row 
# Column with more up-to-date message 


Reader; () { 
; ridx = Latest, 
inc ReaderCatf{ridx); 
cl= Cl[ridx); 
1ead Buff [ridx][ci]; 
dec ReaderCnt[1idx]; 


ff ae eS 


} 

Writer() { 
; for (i = Latest; , i++) 

if (ReaderCat{i mod NRows] == 0) break; 

cl=not Cli]; 
wiite Bufflil[cl]; 
Cl{i) =cl; 
Latest = 1; 


See ae 


~~ pow 
—- oo 
oe 08 


Figure 6: Double Buffer algorithm. 


algorithm. As we will see in Section 4, the number of buffers 
required by the transformed Double Buffer algorithm is usually 
comparable to, if not less than, the original Chen’s algorithm. 
3.3. Improved Double Buffer Algorithm 

Applying the same techniques used in devising the Improved 
Chen’s algorithm, we now try to improve the Double Buffer al- 
gorithm. Again, we divide the reader tasks into fast and slow 
readers. The fast readers need a minimum of JN buffers to 
ensure temporal concurrency control, while the M/ slow read- 
ers use the original Double Buffer scheme. The total message 
buffer requirements will now be 2(M + maz(1, [+] )) buffers, 
which is less than or equal to the original algorithm’s 2( P + 1) 
buffers, assuming correct partitioning of the readers (see Sec- 
tion 3.4). As before, the highest-frequency readers now use the 
very low overhead NBW read mechanism, so execution times 
should be improved as well. 


The data structures and algorithm for Improved Double 
Buffer are shown in Figures 7 and 8, respectively. The slow 
readers are unmodified from the original readers. Fast read- 
ers simply read from the buffer indicated by Latest and the 
corresponding row’s Cl entry. The writer, too, is mostly un- 
modified. To ensure temporal concurrency control for the fast 
readers, the writer should not reuse any particular buffer until 
at least N — 1 subsequent writes have occurred. This is ensured 
by changing the buffer selection loop to search starting at row 
(Latest+1) mod NRows. The rows are used cyclically, 
and the buffers within a row alternate on subsequent writes, 
so {~] rows suffice to ensure temporal concurrency control 
for the fast readers. Therefore, the improved algorithm needs 


2(M + maz(1, | £])) buffers. 


To illustrate overhead improvements, let us consider a sys- 
tem with 20 reader tasks, of which 5 are classified as slow 
readers. Assume further that based on the Naga, calculations 
(Section 2.2), the fast readers need 7 buffers to ensure tempo- 
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bef: mantes of slow readess 
WN: monde of buffers annded for last readers 


Figure 7: Constructs in the Improved Double Buffer algorithm. 


ral isolation from the writer. With Improved Double Buffer, 
we need 18 message buffer slots, while the original needs 42, 
a significant memory reduction. Moreover, the other control 
variables are proportional to the number of rows, so they, too, 
are reduced. With the new algorithm, the slow readers and the 
writer remain virtually unchanged, but the fast readers have less 
computation than the original readers, so the overall execution 
overheads will decrease as well. Generally, as the number of 
fast readers increases, the execution performance increases, but 
this is not necessarily the case for space requirements. In the 
following section, we will determine how to partition a reader 
set into fast and slow readers, optimizing for space. 


3.4 Identification of Fast Readers 


We now presenta simple algorithm for partitioning the reader 
set into fast and slow readers, optimizing for minimum memory 
usage. The algorithm is shown in Figure 9, and can be used 
with any single-writer, multiple-reader IPC scheme improved 
with our transformation by simply changing a few constants to 
match the algorithm. 


The algorithm initially sets all reader tasks to be slow read- 
ers. It keeps the tasks sorted by non-decreasing order of 
their Nagar values, computed as with the NBW protocol (Sec- 
tion 2.2). It tries to move one task at a time from the slow reader 
set to the fast reader set, and recomputes the number of buffers 
needed, (S + F’), where S is the requirement for the slow read- 
ers, and F for the fast readers. By keeping track of the setting 
with lowest memory use so far, after a single pass through all 
of the tasks, we obtain the Spl itpoint, which indicates the 
last fast reader. All tasks with lower Nagaz values are also part 
of the fast reader set. 


This partitioning of the reader set is optimal with respect to 
the number of message buffers. This is easy to show: take 
a partitioning that is space-optimal, and let task z be the fast 
reader with the largest Nayar value. Now, all tasks with lower 
NmMoz Values than task 2 must also be part of the fast reader 
set (otherwise, we can move them to the fast reader set; they 
will not affect the number of buffers needed for the fast read- 
ers (1.e., largest Nagaz value), but will reduce the slow reader 
set’s buffer requirements, and the optimality assumption would 
be invalid). Since the above algorithm considers all partitions 
in which all tasks with less than a particular Nagez value are 
in the fast reader set, the optimal partition will be found by the 
algorithm. 
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int Latest; 

int NRows; 

message Buff[NRows]] 2]; 
int ReaderCnit(NRows}; 
boolean Ci{NRows}; 


# Index to the row with the latest message 
# Number of rows in the message buffer 
# Message buffer 

# Reader count for each row 

# Columa with more up-to-date message 


SlowReader;() { 
: ridx = Latest, 
inc ReaderCnt)ridx]: 
cl = Cl[ridx}; 
read Buff[ridx ]{cl}; 
dec ReaderCnt{ridx]: 


A PwWlhy — 


eee. { 
6: ridx = Latest; 
2 booleancl = Cl[ridx}; 
8: read Buff[ridx][cl]: 


Writer() { 


9: i = (Latest + 1} mod NRows; 
10: ‘for (; i =((i + 1) mod NRows)) 
11: if ( ReaderCnt[i)] — 0 ) break; 
12; cl=notCl[i}; 
13: — write Buff[i][cl]; 
14: Cli] =cl; 
15: —Latest=1, 
} 


Figure 8: Improved Double Buffer algorithm. 


The partitioning algorithm uses certain constants that depend 
on the specific IPC mechanism used. For the initialization, 
Splitpoint is always set to NULL and F' always set to 0, 
but MinNumBuff and S are both set to the number of buffers 
needed assuming that all tasks are slow readers. For the Im- 
proved Chen’s algorithm, this is (P + 2), and for the Improved 
Double Buffer, it is 2( + 1), where P is the number of tasks. 
Additionally, V is the number of buffers used for each addi- 
tional slow reader, and is set to | and 2, for Chen’s and the 
Double Buffer mechanisms, respectively. 


Weillustrate the partitioning algorithm using the sample task 
set in Figure 10, which indicates the writer’s period and relative 
deadline, as well as the readers’ periods (relative deadlines) and 
computation times. Rafaz and N Maz values, assuming C'p is 
negligible and the readers’ relative deadlines are equal to their 
periods, are also shown. Assuming Double Buffer algorithm, 
initially S = MinNumBuff = 16, F = 0, and all readers are 
in the slow reader set. Tasks are moved one at a time accord- 
ing to their Nayar values, so first, Reader O is moved to the 
fast reader set. Now S = 14 and F' = 3, so (F' + S) is not 
the lowest value seen, and Splitpoint is not changed. We 
continue with Reader I, resulting in S = 12 and F = 3, so 
S+F < MinNumBuff holds. Splitpoint is updated to 
Reader 1, and MinNumBuf f is set to S + F. We repeat this 
with all of the readers, in order. By the end, Splitpoint 
points to Reader 4, and MinNumBuf f = 10. So, with the first 
five readers as fast readers, we achieve the minimum number of 
buffers required for this example, a 37.5% reduction from the 
original algorithm. 
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Order reader tasks by Nazaz fromsmiallest to largest; 

# Note: So is the no. of buffers needed if all tasks are slow readers 
S = So: # no. of buffers for slow readers 
=i. # no. of buffers for fast readers 
MinNumBuff = So; 
Splitpoint = NULL; 


Foreachreadertask TR (ordered by Nasaz) 
Move TR from the slow reader set tothe fiist reader set; 
S = Vx sizeof(slow reader set); 
F =Tr’'s(Nmuaz + 1); 
if(S+ F < MinNumBuff) 
Splitpoint = TR: 
MinNumBuff = S + F; 


Figure 9: Algorithm to find space-optimal division of fast and slow 
readers and the amount space required. 

Process 
Writer 
















Reader O 
Reader ! 
Reader 2 


Reader 3 | 22 
Reader 4 | 50 
Reader 5 | 150 125 


Figure 10: Task set with one writer and seven reader processes. 


3.5 Transformation Mechanism 


We have shown here how two different single-writer, 
multiple-reader wait-free IPC mechanisms can be modified to 
take into account real-time characteristics of tasks to reduce 
both memory and execution time overheads. In general, we 
can apply our transformation to other such IPC algorithms with 
the following steps. 


Step 1. Identify fast and slow readers for a particular system: 
simply apply the algorithm in Section 3.4. This will min- 
imize the number of message buffers needed, while still 
ensuring temporal isolation between the writer and the fast 
readers. 


Step 2. Fine-tune reader sets: we may not always want to opti- 
mize for space, so we can adjust the partitioning obtained 
in Step | if needed. 


Step 3. Convert reader code to slow reader code: Typically, 
there are no modifications needed for slow readers, so this 
is just a renaming step. 


Step 4. Introduce fast reader code: The fast readers are triv- 
ially implemented — they just read the pointer indicating 
the most recently written message buffer, and then read 
from that buffer. 


Step 5. Modify writer code to ensure temporal isolation with 
fast readers: this is the most significant change required. 
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Since most algorithms have some code for selecting a 
buffer to write, this step usually only requires modifying 
the selector to ensure that the same buffer is not reused 
within N consecutive writes. Sometimes, this can simply 
be done by using the available buffers in a cyclic fashion, 
and having enough total buffers. 


Applying these steps, we can modify existing wait-free 
single-writer, multiple-reader algorithms to use real-time char- 
acteristics of the tasks and reduce processing and memory 
costs. 


4 Performance Evaluation 


The goal of our transformation mechanism is to reduce 
the time and space overheads when applied to single-writer, 
multiple-reader algorithms. We now evaluate how much im- 
provement we can achieve with the proposed transformation. 
Specifically, we will compare a total of 9 different IPC mecha- 
nisms, including Chen’s, Improved Chen’s, Double Buffer, and 
Improved Double Buffer algorithms. 


We also consider another wait-free, single-writer, multiple- 
reader IPC mechanism, Peterson’s algorithm, as well as our 
transformed version of it. In Peterson’s algorithm [24], the 
reader determines if its read is corrupted, and may have to per- 
form the read up to 3 times. The writer may also have to write a 
message up to (P+ 2) times, where P is the number of readers. 
The mechanism has been revised [34] such that readers read a 
message at most 2 times, and the writer writes a message at 
most (P + 1) times to avoid corruption. We only consider the 
revised version here. We derive the Improved Peterson’s algo- 
rithm by applying our general transformation as described in 
Section 3.5. 


For the purpose of comparison, we also evaluate the NBW 
protocol and the EMERALDS variant of this. As discussed 
earlier, NBW is the most efficient algorithm in terms execution 
time, but may induce high space overheads. The EMERALDS 
IPC mechanism tries to limit memory use at some cost to per- 
formance. Finally, we also include a very efficient implementa- 
tion of synchronization-based IPC, using a lock algorithm that 
relies on the atomic Test-And-Set instruction, to show the trade- 
offs between synchronization-based and synchronization-free 
mechanisms. 


To make fair and comprehensive comparisons between these 


algorithms, we have considered various parameters trying to 
answer the following questions. 


e How much does the transformation reduce the average- 
case and worst-case execution times? 

e How much does the transformation reduce the buffer space 
requirement? 

e Is the transformation applicable in both uniprocessor and 
multiprocessor environments? How do they differ? 


Relative Frequency to the 
Writer Process 

twice as frequent 

1-15 times less frequent 
15S—50 times less frequent 
50-100 times less frequent 


Percentage 
within Class 
15-25% 


15-85% 


Fastest 
Fast 


Slow | Slow 75-85% 
Slowest 15-25% 


Figure 11: Reader task set distribution. 





e Will different message sizes affect the results? 
e Will the size of the reader set affect the results? 


We evaluate the algorithms for memory usage and execu- 
tion time overheads, in both average and worst cases, and for 
both uniprocessor and symmetric multiprocessor (SMP) envi- 
ronments. The only exception is for the EMERALDS IPC 
mechanism, which is evaluated only for uniprocessors. Be- 
cause it assumes that operations are atomic if interrupts are dis- 
abled, it will not work correctly with SMP architectures where 
this assumption does not hold. 


4.1 Experiment Setup 


The algorithms we evaluate in this section are imple- 
mented and executed under EMERALDS OS [35] running on 
a Pentium-III SOOMhz processor. The experiments use a syn- 
thetic reader task set, which is divided into two sets — fast 
readers and slow readers, where ‘fast’ and ‘slow’ are defined 
relative to the writer’s period. In a real system, there are usu- 
ally tasks that are executed very frequently, and tasks that run 
very infrequently. To model this behavior, we further divide 
the fast and slow reader sets into finer-grained categories, as 
shown in Figure 11. By making approximately 20% of fast and 
slow readers either very fast or very slow, the resulting task set 
represents realistic range of task periods that may occur in a 
real-time embedded system. A random reader task set is gen- 
erated for each experiment according to the desired division of 
readers into the four categories. 


4.2 Average vs. Worst-case Execution Time 


The average-case (ACET) and worst-case execution times 
(WCET) to perform an IPC read/write operation are both im- 
portant factors in the performance of an IPC algorithm. A low 
ACET would indicate that the algorithm generally incurs low 
computation overheads. However, to provide timeliness guar- 
antees in embedded real-time systems, the scheduler must ac- 
count for the WCET. An algorithm with low ACET but high 
WCET may result in poor system utilization. 


The ACET and WCET of the SMP versions of the eight eval- 
uated algorithms are shown on the top and bottom rows, re- 
spectively, in Figure 12. The SMP versions of the algorithms 
include bus-lock operations to ensure the atomic operation of 
the critical CAS and TAS instructions with multiple proces- 
sors. The message size is 8 bytes, and the task set consists 
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Figure 12: The top and bottom rows show the average-case and worst-case execution times, respectively, of the SMP version of the algorithms, 
to perform an IPC read / write operation with 8-byte message size. 
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Figure 13: The top and bottom rows show the average-case and worst-case execution times, respectively, of the uniprocessor version of the 
algorithms, to perform an IPC read / write operation with 8-byte message size. 
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Figure 14: These graphs show the space requirements for different algorithms with 8-byte messages (note that the space requirement is 


arc hitec ture-independent). 


of 1 writer and 20 readers, of which a varying fraction are 
fast readers. Specifically, we evaluated these al gorithms when 
the reader set contains 20%, 40%, 60% and 80% fast readers. 
The first three pairs of columns on the graphs are for the three 
single- writer, multi ple-re ader al gorithms and their corres pond - 
ing transformed al gorithms. We can see significant reductions 
in both the ACET and WCET from comparison of the trans- 
formed algorithms with the ori ginal ones. As the number of fast 
readers in the reader set increases, the reduction in computa- 
tion time for the transf ormed al gorithms gets more pronounced. 
ACETs for Double Buffer and Chen’s al gorithms improveby as 
much as 66%, and for Peterson’s al gorithm by as much as 38%. 
This trend is shown in Figure 15(a). 


Although the amount of improvement is a non-decreasing 
function of the percentage of fast readers in the reader set, the 
magnitude of this improvement depends on the particular algo- 
rithm. In these experiments, all of the transformed al gorithms 
perform better than the original versions except for the WCET 
of the Double Buffer al gorithm. This can be attributed to the 
fact that the WCET for the Double Buffer al gorithm occurs in 
the slow readers. As this time is not affected by the number of 
slow readers in the system, the WCET does not improve. For 
the other al gorithms, the WCETs occur in the writers, whose 
overheads are functions of the number of slow readers, and, 
therefore, improve gre atl y. 


It is interesting to note that even though the ACET of the 
lock-based al gorithm is only up to 4 times larger than those of 
the trans formed wait-free al gorithms, its WCET is much higher 
— 4 to 30 times higher. This is in fact an underestimate of the 
true overhead of the lock-based mechanism, since we assume 
no blocking time here. In actual systems, unless the system 
employs mechanisms to limit blocking times, the lock-based 
execution time may be unbounded. 


4.3. Uniprocessor vs. SMP 


The correctness of some as ynchronous al gorithms rel y on the 
fact that certain instructions will be executed atomically. For 
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example, Chen’s al gorithm requires that the CAS instruction 
be performed atomically. For SMP architectures, this requires 
that expensive bus-locking (e.g., by using the LOCK prefix in 
the x86 architecture) be performed to ensure an atomic read- 
modify-write of memory. Under uniprocessor environments, 
however, such measures are generally not needed. In most ar- 
chitec tures, including x86, these instructions are already guar- 
an teed to be atomic with respect to uniprocessor s ystems with- 
out incurring any additional overheads. As a result, we can 
reduce the costs of CAS for Chen’s al gorithm, atomic inc and 
dec for Double Buffer, and TAS for the lock-based mechanism. 
We now repeat the above ex periments, butusin g code restric ted 
to uniprocessor machines. The results, including evaluations of 
the EMERALDS IPC mechanism, are shown in Figure 13. 


As expected, the ACET and WCET of these al gorithms are 
lower than their counterparts for SMP. Even in this case, we 
can still save a significant percentage of execution time over- 
heads. It is worth noting how close the ACETs of the trans- 
formed al gorithms are to the optimal NBW protocol execution 
time. WCET improvements af ter trans formation are even more 
pronounced than for ACET, except with the Double Buffer al- 
gorithm. This anomaly is due to the com plexity we have intro- 
duced in the writer to handle both kinds of readers. Nonethe- 
less, the WCET of the Double Buffer algorithm is still very 
close to that of NBW. 


We summarize the reduction in ACET as the percentage of 
fast readers changes in Figure 15(b). Compared to the SMP re- 
sults in Fi gure 15(a), the ACET reduction in uniprocessor envi- 
ronments is less pronounced. Nonetheless, our transformation 
still reduces a good amount in execution time. 


4.4 Savings in Space 


Thus far, we have shown that our transformation mechanism 
enhances the performance of al gorithms in terms of the ACET 
and the WCET in both SMP and uniprocessor environments. 
Here, we present results to support our claim that the trans- 
formation mechanism not only reduces the time overheads but 
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Figure 15: Varying the percentage of fast readers in the reader set, (a) and (b) show the percentage of savings in ACET for SMP and 
uniprocessor versions of the algorithms, respectively. (c) shows the percentage of savings in space. 


also the space overheads of the algorithms. This is shown in 
Figure 14. Again, we varied the percentage of fast readers in 
the reader set. As expected, the amount of improvement in- 
creases as the number of fast readers increases. Moreover, the 
synchronization-based algorithm requires the least space, since 
it only needs a single shared message buffer. The NBW proto- 
col and lock-based IPC, therefore, represent the extreme cases 
for the tradeoff between space requirements and WCET. The 
non-blocking IPC mechanisms, especially with our transforma- 
tion, provide a good compromise, balancing WCET and mem- 
ory usage. 


Interestingly, the percentage of space reduction for all three 
transformed algorithms is the same, as shown in Figure | 5(c). 
This does make sense since the memory requirements of the 
three original algorithms are all proportional to the number of 
readers in the reader set. So, the memory used by the trans- 
formed algorithms decreases proportionally to the number of 
slow readers. The slight variations in Figure 15(c) are due to 
some of the control variables that do not scale with the num- 
ber of reader tasks. Overall, we achieve a reduction in memory 
usage that ranges from 14 to 70%. 


4.5 Effects of Message and Reader Set Size 


The experiments in the previous sections all use 8-byte mes- 
sages. To see how varying the message size affects the savings 
in time and space, we have performed the same set of experi- 
ments with larger messages (64 bytes). The measurements fol- 
low a similar trend, but the percentage reduction in execution 
time is less than when using 8-byte messages. This is because 
the execution overhead of the actual message buffer read/write 
operation, which cannot be reduced, becomes a more dominant 
part of the total execution overheads. The percentage reduc- 
tions in space overheads are the same, or slightly better than for 
the 8-byte message case, since the constant overheads of some 
of the control variables are less apparent. Due to the substan- 
tially similar results, the 64-byte message measurements are not 
presented here. 


We have also conducted experiments while varying the total 


size of the reader set. Running the previous experiments with 
10 reader tasks resulted in nearly identical relative performance 
improvements with our transformation mechanism. Of course, 
with fewer readers, any complexity increase in the writer task 
has greater weight in the average execution time, but this is off- 
set by the performance gains in the fast readers. Space reduc- 
tion, as before, is basically linear to the percentage reduction 
in the number of slow readers. Again, due to their substantially 
similar results, the data for the 10 readers case are omitted here. 


5 Related Work 


Some earlier work [17,20] on lock-free objects was done us- 
ing read-and-check loops. The reader is required to check if 
its reading was interfered with by the writer, in which case it 
performs the read operation again until it succeeds. Optimiza- 
tion techniques to reduce the number of loops were proposed 
in [15], using an exponential backoff policy. Kopetz et al. [17] 
and Anderson et al. [2] later demonstrated how to bound the 
number of retries by either increasing the buffer size or through 
judicious scheduling. 


Toreduce the time overheads associated with read-and-check 
loops, algorithms that make space and time tradeoffs were later 
proposed [6-8, 17,24, 31,35]. These algorithms providea good 
middle-ground between the purely lock-based approach (high 
WCET) and the purely buffer-based approach (large buffer re- 
quirement). The benefit of these algorithms is that less time 
is wasted in read-and-check loops and the timing behavior is 
more predictable, improving schedulability of task sets as well 
as system utilization. Although the timing behavior is more 
predictable, the computational complexity of these algorithms 
is still high. Moreover, they may still incur a large buffer space 
requirement, and may be difficult to use in small-memory em- 
bedded systems. This difficulty can be overcome by our trans- 
formation mechanism, which makes significant reductions in 
both time and space overheads. 


Most non-blocking algorithms rely on the availability of 
some form of atomic memory update instructions, such 
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as Compare-And-Swap or Load-Linked/Store-Conditional in 
hardware. A few modern hardware platforms, however, do 
not implement some of these instructions. The author of [23] 
demonstrated how to emulate these instructions by synthesiz- 
ing more commonly-implemented instructions to close the gap 
between the primitives that the algorithm designers rely upon, 
and the primitives provided by the hardware. Bershad [5] pro- 
posed how to implement CAS instruction in software by using 
operating system support, and Greenwald et al. [12] general- 
ized this technique to implement Double-Word CAS and Multi- 
Word CAS instructions. Similar work was done in Synthesis 
[22] and Cache kernel [12]. Our transformation mechanism 
does not use such operations, so it is not directly affected by 
whether the atomic operations used by the original IPC algo- 
rithms are supported by the hardware or are emulated. How- 
ever, the degree of performance improvement will be different. 
All of the algorithms we evaluated use atomic update instruc- 
tions supported natively by the x86 architecture. We expect an 
even greater improvement with our transformation if these in- 
structions are emulated since the overheads for emulation will 
most likely be higher. 


Herlihy [13] proposed the first general methodology to trans- 
form sequential data objects to the equivalent non-blocking 
structures. Alemany et al. [1] and LaMarca [19] proposed tech- 
niques to reduce the inefficiencies in applying this methodology 
to large objects at the cost of more communication between the 
application process and the operating system. Other methods to 
improve this were proposed in (4,32). Prakash et al. [26] and 
Turek et al. [32] presented techniques to transform multiple- 
lock concurrent objects into lock-free objects. However, it was 
shown that their transformed algorithms are less efficient than 
the corresponding lock-based algorithms [15,19,30]. These 
authors are concerned with transforming sequential objects to 
non-blocking objects, and the related performance issues. We 
take the next logical step by transforming non-blocking ob- 
jects, in particular, those with single-writer, multiple-reader se- 
mantics, to better-performing and less space-consuming non- 
blocking objects. 


Some interesting work [14, 15,27] has also been done in the 
construction of more complex concurrent objects. Concurrent 
non-blocking array-based stacks, FIFO queues and multiple 
lists were implemented using Double-Compare-And-Swap in 
[12]. Valois introduced non-blocking algorithms for queues, 
linked-lists, and arrays in [33]. Eliot et al. [11] proposed non- 
blocking algorithms for garbage collection. We do not look at 
these complex structures, but focus instead on the more com- 
mon, single-writer, multiple-reader state message construct, 
used for IPC in embedded systems. 


6 Conclusions 


In this paper, we have argued for efficient IPC mechanisms, 
particularly for memory- and processing-power- constrained 
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embedded real-time systems. Traditional and synchronization- 
based IPC methods incur too much time overhead and follow 
incorrect semantics for most of such systems. Instead, we 
considered wait-free, single-writer, multiple-reader IPC algo- 
rithms, which are more appropriate for these systems, but still 
can incur substantial overheads. 


By taking advantage of the temporal characteristics of the 
tasks in these systems, we have proposed a general transfor- 
mation mechanism that can significantly reduce both space and 
time overheads of the wait-free IPC algorithms. This allows 
the most frequently-executing reader tasks to use very low- 
overhead operations, while reducing the total number of buffers 
needed to ensure corruption-free message passing. We have 
demonstrated our transformation on the existing Chen’s algo- 
rithm and the new Double Buffer algorithm that we have intro- 
duced here. 


Our extensive experiments show a 17--66% reduction in 
ACET, and a 14-70% reduction in memory requirements for 
the IPC algorithms improved with our transformation. For al- 
gorithms with relatively high WCETs, these are shown to be 
improved greatly as well. The experiments also demonstrate 
the tradeoff between time and space in IPC mechanisms: the 
NBW protocol is time-optimal, but requires large buffers, while 
a lock-based approach requires just a single message buffer, but 
suffers from very high worst-case execution overheads. Over- 
all, the single-writer, multiple-reader non-blocking algorithms 
are good intermediate solutions, balancing WCET and space 
requirements. With our transformation, we can do even better, 
reducing both time and space requirements of these algorithms. 


This transformation mechanism can be applied to other non- 
blocking IPC algorithms that are notconsidered here, and make 
them better optimized for systems with real-time characteris- 
tics. In the future, we would like to extend our methodology 
to reduce synchronization overheads in more general IPC al- 
gorithms with multiple-writer semantics and to extend this to 
more general communication channels as well. 
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Abstract 


A distributed algorithm for determining the positions of 
nodes in an ad-hoc, wireless sensor network 1s explained 
in detail. Details regarding the implementation of such 
an algorithm are also discussed. Experimentation 1s 
performed on networks containing 400 nodes randomly 
placed within a square area, and resulting error mag- 
nitudes are represented as percentages of each node’s 
radio range. In scenarios with 5% errors in distance 
measurements, 5% anchor node population (nodes with 
known locations), and average connectivity levels be- 
tween neighbors of 7 nodes, the algorithm 1s shown to 
have errors less than 33% on average. It is also shown 
that, given an average connectivity of at least 12 nodes 
and 10% anchors, the algorithm performs well with up 
to 40% errors in distance measurements. 


1 Introduction 


Ad-hoc wireless sensor networks are being developed 
for use In monitoring a host of environmental charac- 
teristics across the area of deployment, such as light, 
temperature, sound, and many others. Most of these 
data have the common characteristic that they are use- 
ful only when considered in the context of where the 
data were measured, and so most sensor data will be 
stamped with position information. As these are ad-hoc 
networks, however, acquiring this position data can be 
quite challenging. 


Ad-hoc systems strive to incorporate as few assumptions 
as possible about characteristics such as the composition 


of the network, the relative positioning of nodes, and 
the environment in which the network operates. This 
calls for robust algorithms that are capable of handling 
the wide set of possible scenarios left open by so many 
degrees of freedom. Specifically, we only assume 
that all the nodes being considered in an instance of 
the positioning problem are within the same connected 
network, and that there will exist within this network 
a minimum of four anchor nodes. Here, a connected 
network is a network in which there 1s a path between 
every pair of nodes, and an anchor node Is a node that is 
given a priori knowledge of its position with respect to 
some global coordinate system. 


A consequence of the ad-hoc nature of these networks 
is the lack of infrastructure inherent to them. With 
very few exceptions, all nodes are considered equal; 
this makes it difficult to rely on centralized computation 
to solve network wide problems, such as positioning. 
Thus, we consider distributed algorithms that achieve 
robustness through iterative propagation of information 
through a network. 


The positioning algorithm being considered relies on 
measurements, with limited accuracy, of the distances 
between pairs of neighboring nodes; we call these 
range measurements. Several techniques can be used 
to generate these range measurements, including time 
of arrival, angle of arrival, phase measurements, and 
received signal strength. This algorithm is indifferent 
to which method is used, except that different methods 
offer different tradeoffs between accuracy, complexity, 
cost, and power requirements. Some of these methods 
generate range measurements with errors as large as 
+50% of the measurement. Note that these errors 
can come from multiple sources, including multipath 
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interference, line-of-sight obstruction, and channel in- 
homogeneity with regard to direction. This work, how- 
ever, is not concerned with the problem of determining 
accurate range measurements. Instead, we assume large 
errors in range measurements that should represent an 
agglomeration of multiple sources of error. Being able 
to cope with range measurements errors is the first of two 
major challenges in positioning within an ad-hoc space, 
and will be termed the range error problem throughout 
this paper. 


The second major challenge behind ad-hoc position- 
ing algorithms, henceforth referred to as the sparse 
anchor node problem, comes from the need for at 
least four reference points with known location in a 
three-dimensional space in order to uniquely determine 
the location of an unknown object. Too few reference 
points results in ambiguities that lead to underdeter- 
mined systems of equations. Recalling the assumptions 
made above, only the anchor nodes will have positioning 
information at the start of these algorithms, and we 
assume that these anchor nodes will be located randomly 
throughout an arbitrarily large network. Given limited 
radio ranges, it is therefore highly unlikely that any 
randomly selected node in the network will be in direct 
communication with a sufficient number of reference 
points to derive its own position estimate. 


In response to these two primary obstacles, we present 
an algorithm split into two phases: the start-up phase 
and the refinement phase. The start-up phase addresses 
the sparse anchor node problem by cooperatively spread- 
ing awareness of the anchor nodes’ positions throughout 
the network, allowing all nodes to arrive at initial posi- 
tion estimates. These initial estimates are not expected 
to be very accurate, but are useful as rough approxima- 
tions. The refinement phase of the algorithm then uses 
the results of the start-up algorithm to improve upon 
these initial position estimates. It is here that the range 
error problem is addressed. 


This paper presents our algorithms in detail, and dis- 
cusses several network design guidelines that should be 
taken into consideration when deploying a system with 
such an algorithm. Section 2 will discuss related work 
in this field. Section 3 will elaborate our two-phase 
algorithm approach, exploring in depth the start-up 
and refinement phases of our solution. Section 4 will 
discuss some subtleties of the algorithm in relation to 
our simulation environment. Section 5 reports on the 
experiments performed to characterize the performance 
of our algorithm. Finally, Section 6 is a discussion of de- 
sign guidelines and algorithm limitations, and Section 7 
concludes the paper. 
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2 Related work 


The recent survey and taxonomy by Hightower and 
Borriello provides a general overview of the state of the 
art in location systems [7]. However, few systems for lo- 
cating sensor nodes in an ad-hoc network are described, 
because of the aforementioned range error and sparse 
anchor node problems. Many systems are based on the 
attractive option of using the RF radio for measuring 
the range between nodes, for example, by observing 
the signal strength. Experience has shown, however, 
that this approach yields very inaccurate distances [8]. 
Much better results are obtained by time-of-flight mea- 
surements, particularly when acoustic and RF signals 
are combined [6, 12]; accuracies of a few percent of 
the transmission range are reported. Acoustic signals, 
however, are temperature dependent and require an un- 
obstructed line of sight. Furthermore, even small errors 
do accumulate when propagating distance information 
over multiple hops. 


A drastic approach that avoids the range error problem 
altogether is to use only connectivity between nodes. 
The GPS-less system by Bulusu et al. [3] employs a grid 
of beacon nodes with known locations; each unknown 


node sets its position to the centroid of the locations of 


the beacons connected to the unknown. The position 
accuracy is about one-third of the separation distance 
between beacons, implying a high beacon density for 
practical purposes. Doherty et al. use the connectivity 
between nodes to formulate a set of geometric con- 
Straints and solve it using convex optimization [5]. The 
resulting accuracy depends on the fraction of anchor 
nodes. For example, with 10% anchors the accuracy 
for unknowns is on the order of the radio range. A 
serious drawback, which is currently being addressed, 
is that convex optimization is performed by a single, 
centralized node. The ‘“DV-hop” approach by Niculescu 
and Nath, in contrast, 1s completely ad-hoc and achieves 
an accuracy of about one-third of the radio range for 
networks with dense populations of (highly connected) 
nodes [10]. In a first phase anchors flood their location 
to all nodes in the network. Each unknown node 
records the position and (minimum) number of hops 
to at least three anchors. Whenever an anchor a, 
infers the position of another anchor ag it computes 
the distance between them, divides that by the number 
of hops, and floods this average hop distance into the 
network. Each unknown uses the average hop distance 
to convert hop counts to distances, and then performs a 
triangulation to three or more distant anchors to estimate 
its own position. “DV-hop” works well in dense and 
regular topologies, but for sparse or irregular networks 
the accuracy degrades to the radio range. 
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More accurate positions can be obtained by using the 
range measurements between individual nodes (when 
the errors are small). When the fraction of anchor nodes 
is high the “iterative multilateration” method by Sav- 
vides et al. can be used [12]. Nodes that are connected 
to at least three anchors compute their position and 
upgrade to anchor status, allowing additional unknowns 
to compute their position in the next iteration, etc. 
Recently a number of approaches have been proposed 
that require few anchors [4, 9, 10, 11]. They are 
quite similar and operate as follows. A node measures 
the distances to its neighbors and then broadcasts this 
information. This results in each node knowing the 
distance to its neighbors and some distances between 
those neighbors. This allows for the construction of (par- 
tial) local maps with relative positions. Adjacent local 
maps are combined by aligning (mirroring, rotating) the 
coordinate systems. The known positions of the anchor 
nodes are used to obtain maps with absolute positions. 
When three or more anchors are present in the network 
a single absolute map results. This style of locationing 
is not very robust since range errors accumulate when 
combining the maps. 


3 Two-phase positioning 


As mentioned earlier, the two primary obstacles to posi- 
tioning in an ad-hoc network are the sparse anchor node 
problem and the range error problem. In order to address 
each of these problems sufficiently, our algorithm is 
separated into two phases: start-up and refinement. For 
the start-up phase we use Hop-TERRAIN, an in-house 
algorithm similar to DV-hop [10]. The Hop-TERRAIN 
algorithm is run once at the beginning of the positioning 
algorithm to overcome the sparse anchor node prob- 
lem, and the Refinement algorithm is run iteratively 
afterwards to improve upon and refine the position 
estimates generated by Hop-TERRAIN. Note therefore 
that the emphasis for Hop-TERRAIN is not on getting 
highly accurate position estimates, but instead on getting 
very rough estimates so as to have a starting point for 
Refinement. Conversely, Refinement is concerned only 
with nodes that exist within a one-hop neighborhood, 
and it focuses on increasing the accuracy of the position 
estimates as much as possible. 


3.1 Hop-TERRAIN 


Before the positioning algorithm has started, most of 
the nodes in a network have no positioning data, with 


the exception of the anchors. The networks being con- 
sidered for this algorithm will be scalable to very large 
numbers of nodes spread over large areas, relative to the 
short radio ranges that each of the nodes is expected to 
possess. Furthermore, it is expected that the percentage 
of nodes that are anchor nodes will be small. This results 
in a Situation in which only a very small percentage 
of the nodes in the network are able to establish direct 
contact with any of the anchors, and probably none of 
the nodes in the network will be able to directly contact 
enough anchors to derive a position estimate. 


In order to overcome this initial information deficiency, 
the Hop-TERRAIN algorithm finds the number of hops 
from a node to each of the anchors nodes in a network 
and then multiplies this hop count by an average hop 
distance (see Section 4.2) to estimate the range between 
the node and each anchor. These computed ranges 
are then used together with the anchor nodes’ known 
positions to perform a triangulation and get the node’s 
estimated position. The triangulation consists of solving 
a system of linearized equations (Ax=b) by means of a 
least squares algorithm, as in earlier work [11]. 


Each of the anchor nodes launches the Hop-TERRAIN 
algorithm by initiating a broadcast containing its known 
location and a hop count of 0. All of the one-hop 
neighbors surrounding an anchor hear this broadcast, 
record the anchor’s position and a hop count of 1, and 
then perform another broadcast containing the anchor’s 
position and a hop count of 1. Every node that hears 
this broadcast and did not hear the previous broadcasts 
will record the anchor’s position and a hop count of 
2 and then rebroadcast. This process continues until 
each anchor’s position and an associated hop count value 
have been spread to every node in the network. It is 
important that nodes receiving these broadcasts search 
for the smallest number of hops to each anchor. This 
ensures conformity with the model used to estimate the 
average distance of a hop, and it also greatly reduces 
network traffic. 


As broadcasts may be omni-directional, and may there- 
fore reach nodes behind the broadcasting node (rela- 
tive to the direction of the flow of information), this 
algorithm causes nodes to hear many more packets 
than necessary. In order to prevent an infinite loop of 
broadcasts, nodes are allowed to broadcast information 
only if it is not stale to them. In this context, information 
is stale if it refers to an anchor that the node has already 
heard from and if the hop count included in the arriving 
packet is greater than or equal to the hop count stored in 
memory for this particular anchor. New information will 
always trigger a broadcast, whereas stale information 
wil] never trigger a broadcast. 
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Once a node has received an average hop distance and 
data regarding at least 3(4) anchor nodes for a network 
existing in a 2(3)-dimensional space, it is able to perform 
a triangulation to estimate its location. If this node 
subsequently receives new data after already having 
performed a triangulation, either a smaller hop count 
Or a new anchor, the node simply performs another 
wiangulation to include the new data. This procedure 
is summarized in the following piece of pseudo code: 


when a positioning packet is received, 
if new anchor or lower hop count 
then 
store hop count for this anchor. 
broadcast new packet for this anchor with 
hop count = (hop count + 1). 
else 
do nothing. 
if average hop count is known and 
number of anchors >= (dimension of space + 1) 
then 
triangulate. 
else 
do nothing. 


The resulting position estimate is likely to be coarse in 
terms of accuracy, but it provides an initial condition 
from which Refinement can launch. The performance 
of this algorithm is discussed in detail in Section 5. 


3.2 Refinement 


Given the initial position estimates of Hop-TERRAIN 
in the start-up phase, the objective of the refinement 
phase is to obtain more accurate positions using the 
estimated ranges between nodes. Since Refinement must 
operate in an ad-hoc network, only the distances to the 
direct (one-hop) neighbors of a node are considered. 
This limitation allows Refinement to scale to arbitrary 
network sizes and to operate on low-level networks that 
do not support multi-hop routing (only a local broadcast 
iS required). 


Refinement is an iterative algorithm in which the nodes 
update their positions in a number of steps. At the 
beginning of each step a node broadcasts its position 
estimate, receives the positions and corresponding range 
estimates from its neighbors, and computes a least 
squares triangulation solution to determine its new po- 
sition. In many cases the constraints imposed by the 
distances to the neighboring locations will force the new 
position towards the true position of the node. When, 
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after a number of iterations, the position update becomes 
small Refinement stops and reports the final position. 
Note that Refinement is by naturean ad-hoc (distributed) 
algorithm. 


The beauty of Refinement is its simplicity, but that also 
limits its applicability. In particular, it was a@ priori not 
clear under what conditions Refinement would converge 
and how accurate the final solution would be. A number 
of factors that influence the convergence and accuracy of 
iterative Refinement are: 


e the accuracy of the initial position estimates, 
e the magnitude of errors in the range estimates, 


the average number of neighbors, and 
the fraction of anchor nodes. 


Based on previous experience we assume that redun- 
dancy can counter the above influences to a large ex- 
tent. When a node has more than 3(4) neighbors in 
a 2(3)-dimensional space the induced system of linear 
equations is over-defined and errors will be averaged out 
by the least squares solver. For example, data collected 
by Beutel [1] shows that large range errors (standard 
deviation of 50%) can be tolerated when locating a node 
surrounded by 5 (or more) anchors in a 2-dimensional 
Space: the average distance between the estimated and 
rue position of the node ts less than 5% of the radio 
range. 


Despite the positive effects from redundancy we ob- 
served that a straightforward application of Refinement 
did not converge in a considerable number of “rea- 
sonable” cases. Close inspection of the sequence of 
steps taken under Refinement revealed two important 
causes: 


1. Errors propagate fast throughout the whole network. 
If the network has a diameter d, then an error intro- 
duced by a node in step s has (indirectly) affected 
every node in the network by step s + d because of 
the triangulate-hop-triangulate-hop: - - pattern. 


2. Some network topologies are inherently hard, or 
even impossible, to locate. For example, a cluster 
of n nodes (no anchors) connected by a single link 
to the main network can be simply rotated around 
the ‘entry’-point into the network while keeping the 
exact same intra- node ranges. Another example ts 
given in Figure |. 


To mitigate error propagation we modified the refine- 
ment algorithm to include a confidence associated with 
each node’s position. The confidences are used to 
weigh the equations when solving the system of linear 
equations. Instead of solving Ax=b we now solve 
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wAx=wb, where w is the vector of confidence weights. 
Nodes, like anchors, that have high faith in their position 
estimates select high confidence values (close to 1). A 
node that observes poor conditions (e.g., few neighbors, 
poor constellation) associates a low confidence (close to 
O) with its position estimate, and consequently has less 
impact on the outcome of the triangulations performed 
by its neighbors. The details of confidence selection will 
be discussed in Section 4.3. The usage of confidence 
weights improved the behavior of Refinement greatly: 
almost all cases converge now, and the accuracy of the 
positions is also improved considerably. 


Another improvement to Refinement was necessary to 
handle the second issue of 1!!-connected groups of nodes. 
Detecting that a single node is ill-connected is easy: if 
the number of neighbors is less than 3(4) then the node 
is ill-connected in a 2(3)-dimensional space. Detecting 
that a group of nodes is 1!]-connected, however, is more 
complicated since some global overview is necessary. 
We employ a heuristic that operates in an ad-hoc fashion 
(no centralized computation), yet is able to detect most 
ill-connected nodes. The underlying premise for the 
heuristic is that a sound node has independent references 
to at least 3(4) anchors. That is, the multi-hop routes 
to the anchors have no link (edge) in common. For 
example, node 3 in Figure | (which is taken from [12]) 
meets this criteria and is considered sound. 


@ Anchor 
OC) Unknown 





Figure 1: Example topology. 


To determine if a node is sound, the Hop-TERRAIN al- 
gorithm records the ID of each node’s immediate neigh- 
bor along a shortest path to each anchor. When multiple 
shortest paths are available, the first one discovered is 
used (this only approximates the intended condition but 
is considerably simpler). These IDs are collected in a 
set of sound neighbors. When the number of unique 
IDs in this set reaches 3(4), a node declares itself sound 
and may enter the Refinement phase. The neighbors 
of the sound node add its ID to their sets and may in 
turn become sound if their sound sets become sufficient. 
This process continues throughout the network. The end 
result is that most ill-connected nodes will not be able to 
fill their sets of sound neighbors with enough entries and, 


therefore, may not participate in the Refinement phase. 
In the example topology in Figure 1, node 3 will become 
sound, but node 4 will not. We also note that the more 
restrictive participating node definition by Savvides et 
al. renders both unknown nodes as ill-conditioned [12]. 


Refinement with both modifications (confidence 
weights, detection of ill-connected nodes) performs 
quite satisfactorily, as will be shown by the experiments 
in Section 5. 


4 Simulation and algorithm details 


To study the robustness of our two-phase positioning 
algorithm we created a simulation environmentin which 
we can easily control a number of (network) parameters. 
We implemented the Hop-TERRAIN and Refinement 
algorithms as C++ code running under the control of 
the OMNeT++ discrete event simulator [13]. The al- 
gorithms are event driven, where an event can be an 
incoming message or a periodic timer. Processing an 
event usually involves updating internal state, and often 
generates output messages that must be broadcast. All 
simulated sensor nodes run exactly the same C++ code. 
The OMNeT++ library is in control of the simulated 
time and enforces a semi-concurrent execution of the 
code ‘running’ on the multiple sensor nodes. 


4.1 Network layer 


Although our positioning algorithm is designed to be 
used in an ad-hoc network that presumably employs 
multi-hop routing algorithms, our algorithm only re- 
quires that a node be able to broadcast a message to all 
of its one hop neighbors. An important result of this 
is the ability for system designers to allow the routing 
protocols to rely on position information, rather than the 
positioning algorithm relying on routing capabilities. 


An important issue is whether or not the network pro- 
vides reliable communication in the presence of concur- 
rent transmission. In this paper we assume that message 
loss or corruption does not occur and that each message 
is delivered at the neighbors within a fixed radio range 
(F) from the sending node. Concurrent transmissions 
are allowed when the transmission areas (circles) do not 
overlap. A node wanting to broadcast a message while 
another message in its area 1s In progress must wait until 
that transmission (and possibly other queued messages) 
completes. In effect we employ a CSMA policy. 
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The functionality of the network layer (local broadcast) 
is implemented in a single OMNeT++ object, which is 
connected to all sensor-node objects in the simulation. 
This network object holds the topology of the simulated 
sensor network, which can be read froma ’’scenario” file 
or generated at random at initialization time. At time 
zero the network object sends a pseudo message to each 
sensor-node object telling its role (anchor or unknown) 
and some attributes (e.g., the position in the case of an 
anchor node). From then on it relays messages generated 
by sensor nodes to the sender’s neighbors within a radius 
of # units. 


4.2 Hop-TERRAIN 


At time zero of the Hop-TERRAIN algorithm, all of 
the nodes in the network are waiting to receive hop 
count packets informing them of the positions and hop 
distances associated with each of the anchor nodes. Also 
at time zero, each of the anchor nodes in the network 
broadcasts a hop count packet, which is received and 
repeated by all of the anchors’ one-hop neighbors. This 
information is propagated throughout the network until, 
ideally, all the nodes in the network have positions and 
hop counts for all of the anchors in the network as well 
as an average hop distance (see bclow). At this point, 
each of the nodes performs a triangulation to create an 
initial estimate of its position. The number of anchors in 
any particular scenario is not known by the nodes in the 
network, however, so it is difficult to define a stopping 
criteria to dictate when a node should stop waiting for 
more information before performing a triangulation. To 
solve this problem, nodes perform triangulations every 
time they receive information that is not stale after hav- 
ing received information from the first 3(4) anchors in a 
2(3)-dimensional space (see Section 3.1 for a definition 
of stale information). 


Nodes also rely on the anchor nodes to inform them 
of the value to use for the assumed average hop dis- 
tance used in calculating the estimated range to each 
anchor. Initially we experimented with simply using 
the maximum radio range for this quantity. Better 
position results, however, are attained by dynamically 
determining the average hop distance by comparing 
the number of hops between the anchors themselves 
to the known distances separating them following the 
calibration procedure used for DV-hop (see Section 2). 
We implemented the calibration procedure as a separate 
pass that follows the initial hop-count flooding. When an 
anchor node receives a hop count from another anchor 
it computes its estimate of the average hop distance, 
and floods that back into the network. Nodes wait for 
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the first such estimate to arrive before performing any 
triangulation as outlined above. Subsequent estimates 
from other anchor pairs are simply discarded to reduce 
network load. 


The above details are sufficient for controlling the Hop- 
TERRAIN algorithm within a simulated environment 
where all of the nodes start up at the same time. One 
important consequence of a real network, however, is 
that the nodes in the network start up or enter the 
network at random times, relative to each other. This 
allows for the possibility that a late node might miss 
some of the waves of propagated broadcast messages 
Originating at the anchor nodes. To solve this, each 
node is programmed to announce itself when it first 
comes online in a new network. Likewise, every node 
is programmed to respond to these announcements by 
passing the new node their own position estimates, the 
positions of all of the anchor nodes they know of, and 
the hop counts and hop distance metrics associated with 
these anchors. Note that, according to the rebroadcast 
rules regarding stale information, this information will 
all benew tothe new node, causing this new nodeto then 
rebroadcast all of the information to all of its one-hop 
neighbors. This becomes important in the cases where 
the new node formsa link between two clusters of nodes 
that were previously not connected. In cases where all or 
most of the new node’s one-hop neighbors came online 
before the new node, this information will most likely 
be considered stale, and so these broadcasts will not be 
repeated past a distance of one hop. 


4.3 Refinement 


The refinement algorithm is implemented as a periodic 
process. The information in incoming messages is 
recorded internally, but not processed immediately. This 
allows for accumulating multiple position updates from 
different neighbors, and responding with a single reply 
(outgoing broadcast message). The task of an anchor 
node is very simple: it broadcasts its position whenever 
it has detected a new neighbor in the preceding period. 
The task of an unknown node is more complicated. 
If new information arrived in the preceding period it 
performs a triangulation to compute a new position 
estimate, determines an associated confidence level, and 
finally decides whether or not to send out a position 
update to its neighbors. 


A confidence is a value between 0 and 1. Anchors imme- 
diately start off with confidence 1; unknown nodes start 
off at a low value (0.1) and may raise their confidence 
at subsequent Refinement iterations. Whenever a node 
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performs a successful triangulation it sets its confidence 
to the average of its neighbors’ confidences. This will, 
in general, raise the confidence level. Nodes close to an- 
chors will raise their confidence at the first triangulation, 
raising in turn the confidence of nodes two hops away 
from anchors on the next iteration, etc. Triangulations 
sometimes fail or the new position is rejected on other 
grounds (see below). In these cases the confidence 1s 
set to zero, so neighbors will not be using erroneous 
information of the inconsistent node in the next iteration. 
This generally leads to new neighbor positions bringing 
the faulty node back into a consistent state, allowing 
it to build its confidence level again. In unfortunate 
cases a node keeps getting back into an inconsistent 
state, never converging to a final position/confidence. 
To warrant termination we simply limit the number of 
position updates of a node to a maximum. Nodes that 
end up with a poor confidence (< 0.1) are discarded and 
excluded from the reported error results; all others are 
considered to be located and included in the results. 


To avoid flooding the network with insignificant or 
erroneous position updates the triangulation results are 
classified as follows. First, a triangulation may simply 
fail because the system of equations 1s underdetermined 
(too few neighbors, bad constellation). Second, the new 
position may be very close to the current one, rendering 
the position update insignificant. We use a tight cut- 
off radius of Ga of the radio range; experimentation 
showed Refinement is fairly insensitive to this value 
as long as it is small (under 1% of the radio range). 
Third, we check that the new position is within the reach 
of the anchors used by Hop-TERRAIN. Similarly to 
Doherty et al. [S] we check the convex constraints that 
the distance between the position estimate and anchor 
a; must be less than the length of the shortest path 
to a; (hop-count,) times the radio range (R). When 
the position drifts outside the convex region, we reset 
the position to the original initial position computed by 
Hop-TERRAIN. Finally, the validity of the new position 
is checked by computing the difference between the sum 
of the observed ranges and the sum of the distances 
between the new position and the neighbor locations. 
Dividing this difference by the number of neighbors 
yields a normalized residue, If the residue is large 
(residue > radio range) we assume that the system of 
equations is inconsistent and reject the new position. To 
avoid being trapped in some local minima, however, we 
occasionally accept bad moves (10% chance), similar 
to a simulated annealing procedure (without cooling 
down), and reduce the confidence by 50%. 


An unexpected source of errors is that Hop-TERRAIN 
assigns the same initial position to all nodes with iden- 
tical hop counts to the anchors. For example, twin 
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Figure 2: Average position error after Hop-TERRAIN 
(5% range errors). 


nodes that share the exact same set of neighbors are both 
assigned the same initial position. The consequence Is 
that a neighbor of two ‘look-alikes’ is confronted with 
a large inconsistency: two nodes that share the same 
position have two different range estimates. Simply 
dropping one of the two equations from the triangulation 
yields better position estimates in the first iteration of 
Refinement and even has a noticeable impact on the 
accuracy of the final position estimates. 


5S Experiments 


In order to evaluate our algorithm, we ran many exper- 
iments on both Hop-TERRAIN and Refinement using 
the OMNeT++ simulation environment. All data points 
representaverages over 100 trials in networks containing 
400 nodes. The nodes are randomly placed, with a 
uniform distribution, within a square area. The spec- 
ified fraction of anchors is randomly selected, and the 
range between connected nodes is blurred by drawing 
a random value from a normal distribution having a 
parameterized standard deviation and having the true 
range as the mean’. The connectivity (average number 
of neighbors) is controlled by specif ying the radio range. 
To allow for easy comparison between different scenar- 
ios, range errors as well as errors on position estimates 
are normalized to the radio range (i.e. 50% position error 
means half the range of the radio). 


Figure 2 shows the average performance of the Hop- 
TERRAIN algorithm as a function of connectivity and 
anchor population in the presence of 5% rangeerrors. As 
seen in this plot, position estimates by Hop-TERRAIN 


'Ranges are enforced to be non-negative by clipping values below 
zero. 
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Figure 3: Average position error after Refinement (5% 
range errors). 
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Figure 4: Fraction of located nodes (2% anchors, 5% 
range errors). 


have an average accuracy under 100% error in scenarios 
with at least 5% anchor population and an average 
connectivity level of 7 or greater. In extreme situations 
where very few anchors exist and connectivity in the 
network is very low, Hop-TERRAIN errors reach above 
250%. 


Figure 3 displays the results from the same experiment 
depicted in Figure 2, but now the position estimates 
of Hop-TERRAIN are subsequently processed by the 
Refinement algorithm. Its shape is similar to that of 
Figure 2, showing relatively consistent error levels of 
less than 33% in scenarios with at least 5% anchor pop- 
ulation and an average connectivity level of 7 or greater. 
Refinement also has problems with low connectivity and 
anchor populations, and is shown to climb above 50% 
position error in these harsh conditions. Overall Refine- 
ment improves the accuracy of the position estimates by 
Hop-TERRAIN by a factor three to five. 


Figure 4 helps to explain the sharp increases in posi- 
tioning errors for low anchor populations and sparse 
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Figure 5: Average position error after Hop- TERRAIN 
(2D grid, 5% range errors). 
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Figure 6: Range error sensitivity (10% anchors, connec- 
tivity 12). 


networks shown in figures 2 and 3. Figure 4 shows that, 
as the average connectivity between nodes throughout 
the network decreases past certain points, both algo- 
rithms break down, failing to derive position estimates 
for large fractions of the network. This is due simply to 
a lacking of sufficient information, and is a necessary 
consequence of loosely connected networks. Nodes 
can only be located when connected to at least 3(4) 
neighbors; Refinement also requires a minimal confi- 
dence level (0.1). It should be noted that the results 
in Figure 4 imply that the reported average position 
errors for low connectivities in figures 2 and 3 have 
low statistical significance, as these points represent only 
small fractions of the total network. Nevertheless, the 
general conclusion to be drawn from figures 2, 3, and 
4 is that both Hop-TERRAIN and Refinement perform 
poorly in networks with average connectivity levels of 
less than 7. 


Since connectivity has a pronounced effect on position 
error we were interested if other topological characteris- 
tics would show large effects as well. In the following 
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Figure 7: Cumulative error distribution (5% range er- 
rors). 


experiment we randomly place 400 nodes on the vertices 
of a 200x200 grid, rather than allowing the nodes to 
sit anywhere in the square area. We found that the 
grid layout did not result in better performance for the 
Refinement algorithm, relative to the performance of 
the Refinement algorithm with random node placement. 
We do not include a plot here because it looks almost 
identical to Figure 3. We did find a difference in per- 
formance for Hop-TERRAIN though. Figure 5 shows 
that placing the nodes on a grid dramatically reduces 
the errors of the Hop-TERRAIN algorithm in the cases 
where connectivity or anchor node populations are low. 
For example, with 5% anchors and a connectivity of 8 
nodes, the average position error decreases from 95% 
(random distribution) to 60% (grid). We suspect this 
is due to the consistent distances between nodes, the 
ideal topologies within clusters that result form the grid 
layout, and the inherently optimized connectivity levels 
across the entire network. 


Sensitivity to average error levels in the range measure- 
ments is a major concem for positioning algorithms. 
Figure 6 shows the results of an experiment in which we 
held anchor population and connectivity constant at 10% 
and 12 nodes, respectively, while varying the average 
level of error in the range measurements. We found 
that Hop-TERRAIN was almost completely insensitive 
to range errors. This is a result of the binary nature 
of the procedure in which routing hops are counted; 
if nodes can see each other, they pass on incremented 
hop counts, but at no time do any nodes attempt to 
measure the actual ranges between them. Unlike Hop- 
TERRAIN, Refinement does rely on the range measure- 
ments performed between nodes, and Figure 6 shows 
this dependence accordingly. At less than 40% error in 
the range measurements, on average, Refinement offers 
improved position estimates over Hop-TERRAIN. The 
results improve steadily as the range errors decrease. 
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Figure 8: Relation between confidence and positioning 
error (average and standard deviation). 


For reference we determined the best possible position 
information that can be obtained in each case. For each 
node we performed a triangulation using the true posi- 
tions of its neighbors and the corresponding erroneous 
range measurements. The resulting position errors are 
plotted as the lower bound in Figure 6. This suggests 
that there is room for improvement for Refinement. 


Up until this point we reported average position errors. 
Figure 7, in contrast, gives a detailed look at the distri- 
bution of the position errors for individual nodes under 
four different scenarios. Note that the distributions have 
similar shapes: many nodes with small errors, large tails 
with outliers. Refinement’s confidence metrics are to 
some extent capable of pinpointing the outliers. Figure 8 
shows the relationship between position error levels and 
the corresponding confidence values assigned to each 
node. The data for Figure 8 was taken from the best 
and worst case scenarios from the same experiment used 
to generate Figure 7. As desired, the nodes with higher 
position errors are assigned lower confidence levels. 
In the easier case, the confidence indicators are much 
more reliable than in the more difficult case. The large 
standard deviations, however, show that confidence is 
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Figure 9: Geographic error distribution (5% anchors, 
connectivity 12, 5% range errors). 


not a good indicator for position accuracy. This 1s 
unfortunate since a reliable confidence metric would be 
very useful for applications, for example, to identify 
regions of “bad” nodes. Currently, the value of using 
confidence levels is the improved average positioning 
errorcompared to a naiveimplementation of Refinement 
without confidences. 


Finally, yet another useful way of looking at the distri- 
bution of errors over individual nodes is to take their 
geographical location into account. Figure 9 plots 
positioning errors as a function of a nodc’s location 
in the square testing area. This experiment used 400 
randomly placed nodes, an anchor population of 5%, 
an average connectivity level of 12, and range errors of 
5%. The error distribution in Figure 9 is quite typical 
for many scenarios showing that areas along the edges 
of the network lacking a high concentration of anchor 
nodes are particularly susceptible to high position errors. 


6 Discussion 


It is interesting to compare our results from the previ- 
ous section with the alternative approaches discussed 
in Section 2. First, we discuss the performance of 
Hop-TERRAIN and related algorithms that do not use 
range measurements. Hop-TERRAIN is similar to the 
‘“—DV-hop” algorithm by Niculescu and Nath [10], but we 
get consistently higher position errors, for example, 69% 
(Hop-TERRAIN) versus 35% (DV-hop) on a scenario 
with 10% anchors and a connectivity of 8. Under 
poorer network conditions though, Hop-TERRAIN is 
more robust than DV-hop, showing about a factor of 
2 improvement in position accuracy in sparsely con- 
nected networks. Regardless, the trend observed in 
both studies is the same: when the fraction of anchors 
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drops below 5%, position errors rapidly increase. The 
convex optimization technique by Doherty et al. [5] is 
about as accurate as Hop-TERRAIN, except for very 
low fractions of anchors. For example, convex opti- 
mization achieves position errors that are above 150% 
ona scenario (200 nodes, 5% anchors, connectivity of 6) 
where Hop-TERRAIN errors are around 125%; the gap 
grows for even lower fractions of anchors. As mentioned 
earlier, convex optimization is a centralized algorithm . 


The results of Refinement are comparable to those re- 
ported by Savvides et al. for an “iterative multilatera- 
tion” scenario with 50 nodes, 20% anchors, connectivity 
10, and | % range errors [12]. Their algorithm, however, 
can handle neither low anchor fractions nor low con- 
nectivities, because positioning starts from nodes con- 
nected to at least 3 anchors. Refinement still performs 
acceptably well with few anchors or a low connectivity. 
Furthermore the preliminary results of their more ad- 
vanced ‘“‘collaborative multilateration” algorithm show 
that Refinement is able to determine the position of a 
larger fraction of unknowns: 56% (Refinement) versus 
10% (collaborative multilateration) on a scenario with 
just 5% anchors (200 nodes, connectivity 6). 


The “Euclidean” algorithm by Niculescu and Nath uses 
range estimates to construct local maps that are unified 
into a single global map [10]. The results reported for 
random configurations show that ‘Euclidean”’ is rather 
sensitive to range errors, especially with low fractions of 
anchors: in case of 10% anchors their Hop-TERRAIN 
equivalent (DV-hop) outperforms Euclidean. Refine- 
ment achieves better position estimates and is more 
robust since the cross over with Hop-TERRAIN occurs 
around 40% range errors (see Figure 6). 


In summary, the performance of Hop-TERRAIN and 
Refinement is comparable to other algorithms in the 
case of “easy” network topologies (high connectivity, 
many anchors) with low range errors, and outperforms 
the competition in difficult cases (low connectivity, few 
anchors, large range errors). The results of refinement 
can most likely be improved even further when the 
placement of anchors nodes can be controlled given the 
positive experience reported by others [2, 5]. Since 
the largest errors occur along the edges of the network 
(see Figure 9), most anchors should be placed on the 
perimeter of the network. Another approach to increase 
the accuracy of locationing systems is to use other 
sources of information. When locating sensors in a 
room, for example, knowing that the sensors are wall 
mounted eliminates one degree of freedom. Incorporat- 
ing such knowledge in localization algorithms, however, 
requires great care. For example, knowing that two 
sensors cannot communicate does not imply that they 
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are located far apart since a wall may simply prohibit 
radio communication. 


Based on the experimental results from Section 5 and the 
discussion above we recommend a number of guidelines 
for the installation of wireless sensor networks: 


e place anchors carefully (i.e. at the edges), and either 
e ensure a high connectivity (> 10), or 
e employ a reasonable fraction of anchors (> 5%). 


This will create the best conditions for positioning algo- 
rithms in general, and for Hop-TERRAIN and Refine- 
ment in particular. 


7 Conclusions and future work 


In this paper we have presented a completely distributed 
algorithm for solving the problem of positioning nodes 
within an ad-hoc, wireless network of sensor nodes. 
The procedure ts partitioned into two algorithms: Hop- 
TERRAIN and Refinement. Each algorithm is described 
in detail. The simulation environment used to evaluate 
these algorithms is explained, including details about 
the specific implementation of each algorithm. Many 
experiments are documented for each algorithm, show- 
ing several aspects of the performance achieved under 
many different scenarios. The results show that we are 
able to achieve position errors of less than 33% in a 
scenario with 5% range measurement error, 5% anchor 
population, and an average connectivity of 7 nodes. 
Finally, guidelines tor implementing and deploying a 
network that will use these algorithms are given and 
explained. 


An important aspect of wireless sensor networks 1s 
energy consumption. In the near future we therefore plan 
to study the amount of communication and computation 
induced by running Hop-TERRAIN and Refinement. A 
particularly interesting aspect 1s how the accuracy vs. 
energy consumption trade-off changes over subsequent 
iterations of Refinement. 
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Abstract 


The typical duration of multimedia streams makes wire- 
less network interface (WNIC) energy consumption a 
particularly acute problem for mobile clients. In this 
work, we explore ways to transmit data packets in a pre- 
dictable fashion; allowing the clients to transition the 
WNIC to a lower power consuming sleep state. First, we 
show the limitations of IEEE 802.11 power saving mode 
for isochronous multimedia streams. Without an under- 
standing of the stream requirements, they do not offer 
any energy savings for multimedia streams over 56 kbps. 
The potential energy savings is also affected by multiple 
clients sharing the same access point. On the other hand, 
an application-specific server side traffic shaping mech- 
anism can offer good energy saving for all the stream 
formats without any data loss. We show that the mech- 
anism can save up to 83% of the energy required for re- 
ceiving data. The technique offers similar savings for 
multiple clients sharing the same wireless access point. 
For high fidelity streams, media players react to these 
added delays by lowering the stream fidelity. We pro- 
pose that future media players should offer configurable 
settings forrecognizing such energy-aware packet delay 
mechanisms. 


1 Introduction 


The proliferation of inexpensive, multimedia capable 
mobile devices and ubiquitous high-speed network tech- 
nologies to deliver multimedia objects is fueling the de- 
mand for mobile streaming multimedia. Public venues 
[31] are deploying high speed IEEE 802.1 1b [24] based 
public wireless LAN networks. Commodity PDA de- 
vices that allow the users to consume mobile streaming 
multimedia are becoming popular. A necessary feature 
for mass acceptance of a streaming multimedia device 
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is acceptable battery life. Advances in hardware and 
software technologies have not been matched by corre- 
sponding improvements in battery technologies. Future 
trends in battery technologies alone (along with the con- 
tinual pressure for further device miniaturization) do not 
promise dramatic improvements that will make this ts- 
sue disappear. 


Newer hardware improvements are reducing the power 
consumption of system components such as back-lit dis- 
plays, CPUs etc. However, WNICs operating at the same 
frequency band and range continue to consume signifi- 
cant power. Earlier work by Stemm et al. [32] reported 
that the network interfaces draw significant amounts of 
power. For example, a 2.4 GHz Wavelan DSSS card (11 
Mbps) consumes 177 mW while in sleep state, but con- 
sumes 1319 mW while id/e. Havinga et al. [18] noted 
that this Wavelan card consumes 1425 mW for receiving 
data and 1675 mW for transmitting data. For compar- 
ison, a fully operational Compaq iPAQ PDA only con- 
sumes 929 mW while the same iPAQ consumes 470 mW 
with the backlight turned off [14, 8]. For reference, the 
IPAQ is equipped with two 2850 mWh batteries, one 
each in the unit and the PCMCIA sleeve. Streaming me- 
dia tends to be large and long running and consume sig- 
nificant amounts of network resources to download data. 
Hence, it is important to look at techniques to reduce the 
energy consumed by the network interface to download 
the multimedia stream. 


Traditionally, reducing the fidelity of the stream and 
hence the size is a popular technique that is used to 
customize the multimedia stream for a low bandwidth 
network. Reducing multimedia fidelity can also be ex- 
pected to reduce the amount of data and hence the total 
energy consumed. However, if care is not taken to re- 
turn the network interface to the sleep state as much as 
possible, reducing the amount of transmitted data will 
have negligible effect on the overall client energy con- 
sumption. Frequent switching to low power consump- 
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tion states also promises the added benefit of allowing 
the batteries to recover, exploiting the battery recovery 
effect [7]. 


In our earlier work [5], we explored the client WNIC 
energy implications of popular streaming formats (Mi- 
crosoft media [28], Real media [29] and Quicktime [2]) 
under varying network conditions. We believe that these 
widely popular formats are more likely to be deployed 
than custom streaming formats (that are specially opti- 
mized for lower energy consumption). Based on our ob- 
servations, we developed history based client-side tech- 
niques to exploit the stream behavior and lower the en- 
ergy required to receive these streams. We illustrated 
the limitations of such client-side policies in predicting 
the next packet arrival times. These client-only policies 
do not allow us to achieve the potential energy saving 
for consuming multimedia streams without losing data 
packets. 


In general, mechanisms that make the data packets arrive 
at predictable intervals can facilitate such transitions to 
lower power states. The choice of these transmission pe- 
riods is a trade-off between frequent transitions to high 
power states and added delays in the multimedia stream 
reception. Such traffic shaping can be realized either in 
the origin server, in the network infrastructure closer to 
the mobile client and in the access point itself; at the 
MAC level. In this work, we analyze the effectiveness 
of these different approaches in regulating the streams 
to transmit data packets at regular, predictable intervals. 
Such packet arrivals enable client-side mechanisms to 
effectively transition the wireless interfaces to a lower 
power consuming sleep state. 


In this work, we show the limitations of MAC level [EEE 
802.11 power saving mode for isochronous multimedia 
streams. The access points have to balance potential en- 
ergy savings for a single mobile client with a need for 
fair allocation of network resources. Without an under- 
standing of the stream requirements, these MAC level 
mechanisms do not offer any energy savings for multi- 
media streams over 56 kbps. The potential energy sav- 
ings also reduces for multiple clients sharing the same 
access point. On the other hand, a server side traffic 
shaping mechanism can offer good energy saving for 
all the stream formats without any data loss. We show 
that the mechanism can save up to 83% of the energy 
required for receiving useful data. The technique can 
also offer similar savings for multiple clients sharing the 
same wireless access point. For high fidelity streams, 
typical media systems react to these added delays by 
lowering the stream quality. We propose that future me- 
dia players offer configurable settings for clients operat- 
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ing under energy conserving WLAN systems such that 
these delays are not associated with network congestion. 


The remainder of this paper is organized as follows: Sec- 
tion 2 reviews our previous work as the necessary back- 
ground and places our work in context to other related 
work. Next we present the experimental setup, evalua- 
tion methodologies, measurement metrics and the work- 
loads used in our study in Section 3. Section 4 ana- 
lyzes the effectiveness of HEEE 802.11 power manage- 
ment scheme to conserve energy for our streams. Sec- 
tion 5 explores the effectiveness of server side assistance 
in conserving the WNIC energy on the client. We con- 
clude in Section 6. 


2 Related work 


There has been considerable work on power manage- 
ment for components of a mobile device. This work 
includes spindown policies for disks and alternatives 
(35, 3, 26, 12, 19}, scheduling policies for reducing 
CPU energy consumption [34, 17] and managing wire- 
less communications [20, 11, 21,30}. Our work is sim- 
ilar in spirit to the work of Feeney et al. [15]. They ob- 
tain detailed measurements of the energy consumption 
of an IEEE 802.11 wireless network interface operat- 
ing in an ad hoc networking environment. They showed 
that the energy consumption of an IEEE 802.11 wire- 
less interface has a complex range of behavior and that 
the energy consumption was not synonymous with band- 
width utilization. Our work explores similar techniques 
for WNIC energy management for multimedia traffic. 


Lorch et al. [27] presented a survey of the various soft- 
ware techniques for energy management. Havinga et 
al. {18} presented an overview of techniques for energy 
management of multimedia streams. Agrawal et al. [1] 
described techniques for processing video data for trans- 
mission under low battery power conditions. Corner et 
al. [10] described the time scales of adaptation for mo- 
bile wireless video-conferencing systems. 


Ellis [13] advocates high level mechanisms for power 
management. Flinn et al. [16] demonstrated such a col- 
laborative relationship between the operating system and 
application to meet user-specified goals for battery du- 
ration. Vahdat et al. [33] proposed that energy as a 
resource should be managed by the operating system. 
Kravets et al. [22] advocated an end-to-end model for 
conserving energy for wireless communications. In our 
earlier work [6], we utilized transcoding as an applica- 


USENIX Association 


USENIX Association 


tion level technique to reduce the image data; trading off 
image size for network transmission and storage costs. 
In this work, we explore high level mechanisms to en- 
able the mobile client to transition to lower power states 
and reduce overall energy requirements. 


2.1 Client-side history based adaptation mech- 
anism 


In our earlier work [5], we explored the energy impli- 
cations of popular streaming formats (Microsoft media 
[28], Real media [29] and Quicktime [2]) under varying 
network conditions. We believe that these widely popu- 
lar formats are more likely to be deployed than custom 
streaming formats (that are specially optimized for lower 
energy consumption). We showed that Microsoft media 
tended to transmit packets at regular intervals. For high 
bandwidth streams, Microsoft media exploits network 
level fragmentation, which can lead to excessive packet 
loss (and wasted energy) in a lossy network. Real stream 
packets tend to be sent closer to each other, especially 
at higher bandwidths. Quicktime packets sometimes ar- 
rive in quick succession; most likely an application level 
fragmentation mechanism. 


Based on our observations, we developed history based 
client-side techniques to exploit the stream behavior and 
lower the energy required to receive these streams. We 
illustrated the limitations of such client-side policies in 
predicting the next packet arrival times. We showed that 
the regularity of Microsoft media packet arrival rates al- 
low simple, history based client-side policies to transi- 
tion to lower power states with minimal data loss. A M1- 
crosoft media stream optimized for 28.8 Kbps can save 
over 80% in energy consumption with 2% data loss. A 
high bandwidth stream (768 Kbps) can still save 57% 
in energy consumption with less than 0.3% data loss. 
For comparison, the WNIC was only receiving data for 
0.45% and 14.51% of the time for these two streams, re- 
spectively. Also, both Real and Quicktime packets were 
harder to predict at the client-side without understanding 
the semantics of the packets themselves. Quicktime’s 
energy savings came at a high data loss, while Real of- 
fered negligible energy savings. We believe that modif y- 
ing Real and Quicktime services to transmit larger data 
packets at regular intervals can offer better energy con- 
sumption characteristics with minimal latency and jitter. 


2.2 MAC level power saving modes 


Wireless technologies such as IEEE 802.1 1 (25, 24] and 
Bluetooth [4] use a scheduled rendezvous mechanism of 
power Saving wherein the wireless nodes switch to a low 
power sleep mode and periodically awaken to receive 
data from other nodes. Different wireless technologies 
utilize variations of this scheduled rendezvous mecha- 
nism. For example, IEEE 802.11 uses scheduled bea- 
cons along with a TIM/DTIM packet notification mech- 
anism. Bluetooth, a wireless cable replacement tech- 
nology, defines three different power saving modes; the 
sniff mode which defines a variable slave specific activ- 
ity delay interval, the hold mode wherein the slave can 
sleep for a predetermined interval without participating 
in the data traffic and the park mode wherein the slave 
gives up its active member address. The potential en- 
ergy saving progressively decreases from sniff to hold 
to park modes. In general, Bluetooth enabled devices 
with their range less than 10 meters consume less power 
than IEEE 802.11 based WLAN devices. For example, a 
typical Bluetooth device (Ericsson PBA 313 O1/2) con- 
sumes 84 mW to receive data, 126 mW to transmit data 
while consuming as little as 2 mW in the park state. En- 
ergy consumption in a fully packaged system is likely 
to be higher. Bluetooth technology offers a vertically 
integrated solution with a lower data rate and range as 
compared to wireless LAN technologies. As such, Blue- 
tooth technologies are tuned towards cable replacement 
rather than for isochronous traffic generated by multime- 
dia traffic. 


3 System Architecture 


In the last section we outlined earlier work on mecha- 
nisms for energy efficient multimedia streaming as well 
as the limitations of client-only policies in conserving 
energy without losing data packets. In this section, we 
describe the objectives, the system architecture and the 
experimental setup. We will highlight our experiences 
in the next two sections. 


3.1 Objectives 


Our primary goal was to reduce the energy required 
by the wireless client to consume a certain multimedia 
stream (of a given quality). We explore the effectiveness 
of energy aware traffic shaping in the network infras- 
tructure closer to the mobile client and in the wireless 
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Figure 1: System Architecture 


access point itself. The mechanism should also allow 
energy savings for multiple wireless clients sharing the 
same wireless access point. 


Our experiments were designed to answer the following 
questions: 


e Can we realize energy savings at the network MAC 
level without an understanding of the application 
level stream dynamics? 


e Can traffic shaping assistance from the network in- 
frastructure allow the client to transition to lower 
power consuming states more effectively? 


3.2 Architecture 


The system architecture is illustrated in Figure |. The 
system consists of a server side proxy (SSP) or local 
proxy (LP) and a client-side proxy (CSP). The server 
side proxy allows the flexibility of traffic shaping at 
the source, without actually modifying the multimedia 
servers themselves. The local proxy performs similar 
functionality to a server side proxy and shares the same 
LAN network with the wireless client. The server side 
and local proxies can inform the client-side proxy of the 
next scheduled data burst. The client-side proxy inter- 
acts with the server side or local proxies. It is the respon- 
sibility of the client-side proxy to transition the WNIC to 
a lower power sleep state between scheduled data trans- 
fers. Since no data transfers are expected during this 
sleep interval, no data is lost. 


3.3 Experiment Setup 


The system setup that was used to customize the network 
transmissions to conserve energy for popular streaming 
formats is illustrated in Figure 2. The various compo- 
nents of our system are: 


e Multimedia Server: Our multimedia server (Dell 
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330) was equipped with a 1.5 GHz Pentium 4 with 
512 MB of PC800 RDRAM memory, running Mi- 
crosoft Windows 2000 Server SP2. The server was 
running Windows Media Service, Realserver 8.01 
and Apple Darwin Server 3.0.1. 


e Wireless Access Point: For our experiments, we 
used a dedicated D-Link DWL 1000, Orinoco RG 
1000 and Orinoco AP 500 access points. The 
AP 500 was connected to an external range exten- 
der antenna. Throughout our experiments, we had 
turned off the security encryption feature of the ac- 
cess points. 


e Local proxy: We used a Dell dual processor (Pen- 
tium Xeon 933 MHz) server with 1.5 GB of mem- 
ory, running FreeBSD 4.4 (STABLE) for the local 
proxy. The proxy buffered packets from the multi- 
media servers and transmitted them after the con- 
figured delay. The proxy added these delays after 
transmitting all pending packets. 


e Browser Stations: We used two 500 MHz Pen- 
tium III laptops with 256 MB RAM and running 
Windows 98 for our browsing stations. Wireless 
connectivity was provided by 11 Mbps Orinoco 
PCMCIA WLAN cards. The laptops accessed the 
streaming formats using Microsoft media, Real and 
Quicktime players. 


e Monitoring Station: The packets transmitted from 
the server to the browser station was passively cap- 
tured by the monitoring station, which was physi- 
cally kept close to the browser station and the wire- 
less access point. Weused anIBM T21 laptop with 
800 MHz Pentium III processor, 256 MB RAM and 
running Redhat Linux 7.2. Packets were capturing 
using tcpdump 3.6. We assume that the packets ar- 
rive at similar time durations to the tcpdump and 
the browser applications. 


The tcpdump packet traces captured by the Monitoring 
Station were fed to our client-side proxy simulator to an- 
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Figure 3: Policies to shape the network traffic to arrive 
at predictable intervals 


alyze the system energy performance without perturb- 
ing the Browser stations. We utilized published power 
parameters [32, 18] for a 2.4 GHz DSSS IEEE 802.11b 
Orinoco card for our simulations. The model does not 
simulate lower level energy costs such as unsuccessful 
attempts to acquire the channel (media contention), or in 
messages lost due to collision, bit error or loss of wire- 
less connectivity. Further, our simulations do not lever- 
age the battery recovery effect. We assume a linear en- 
ergy model for wireless NIC power consumption. 


3.3.1 Multimedia Stream Collection 


For our experiments, we used the Wall (movie) theatri- 
cal trailer. We replayed the trailer from a DVD player 
and captured the stream using the Dazzle Hollywood DV 
Bridge. We used Adobe Premiere 6.0 to convert the cap- 
tured DV stream to the various streaming formats. The 
trailer was 1:59 mmutes long. The Wall trailer was dig- 
itized to a high quality stream and hence allowed us the 
flexibility of creating streams of varying formats and fi- 
delities. Hence, we use this stream for the rest of this 


paper, 


3.4 Traffic shaping policies in the network in- 
frastructure 


The various states of packet transmission for a traf- 
fic shaping network proxy is illustrated in Figure 3. 
The proxy buffers network packets from diffierent flows 
(clients) and transmits them at client-specific intervals. 
The server maintains separate delay intervals per indi- 
vidual client (e.g. delay] and delay2) and transmits 
packets to avoid contention on the wireless network. 
Such traffic shaping allows multiple clients to operate 
without interfering with each others’ sleep intervals. 
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Figure 4: IEEE 802.1 1 Power Saving Mode (simplified) 
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3.5 Performance metrics 


For our experiments, we use the following performance 
metrics to measure the efficacy of our approach: 


e Energy consumed: The goal of these experiments 
is to reduce the energy consumed by the WNIC. 


Energy metric: Depending on the delay intro- 
duced by our traffic shaping policies, the multime- 
dia players automatically (and incorrectly; since the 
effective bandwidth available was still the same) 
adapted to the delays by lowering the stream fi- 
delity. In order to compare the energy consump- 
tion in such a scenario (where the amount of data 
received can be different), we introduce the notion 
of energy metric; defined as the amount of energy 
required to download multimedia data (denoted in 
Joules/KB). It is preferable to decrease the energy 
metric. In general, even though low fidelity streams 
consume less overall energy, the energy metric is 
high as the WNIC’s spend most of the time in 
wasted idle or sleep states (instead of actually re- 
ceiving useful data). 


4 Implications of IEEE 802.11 Power 
Management in Wireless Access Points 


First we explore the effectiveness of IEEE 802.11 MAC 
level power saving mode for conserving the client WNIC 
energy consumption. In the next section, we investigate 
a proxy architecture to effect energy savings. 


The IEEE 802.11 wireless LAN access standard [23] 
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Figure 5: Energy consumed for varying Wait times 


defines a power saving mode of operation wherein the 
wireless station and the access point cooperates to con- 
serve energy (illustrated in Figure 4). The wireless sta- 
tion informs the access point of its intention to switch 
to power saving mode. Once in power saving mode, 
the wireless station switches the WNIC cards to a lower 
power sleep state; periodically waking up to receive bea- 
cons from the access point. The access point buffers any 
packets for this station and indicates a pending packet 
using a traffic indication map (TIM). These TIMs are 
included within beacons that are periodically transmit- 
ted from the access point. On receipt of an indication 
of a waiting packet at the access point, a wireless client 
sends a PS poll frame to the access point and waits fora 
response in the active (higher energy consuming) state. 
The access point responds to the poll by transmitting 
the pending packet or indication for future transmission. 
The access point indicates the availability of multiple 
buffered packets using the More data field. For multi- 
cast and broadcast packets, the access point transmits 
an indication for pending packets using a delivery TIM 
(DTIM) beacon frame; immediately followed by the ac- 
tual multicast or broadcast packet (without an explicit 
PS poll from clients). DTIM intervals are usually config- 
urable at the access point and can be any multiple of the 
beacon interval. The standard does not define the buffer 
management or aging policies in the access points. Note 
that the standard does not define a power saving trans- 
mit mode for the wireless station itself, it can transmit a 
packet whenever it wants (regardless of the beacon). 


At first glance, it would appear that the 802.11 power 
Saving mode can indeed allow the wireless clients to 
conserve energy by allowing them to frequently transi- 
tion to lower power consuming sleep state. However, 
the potential energy saving depends on minimal Wait in- 
terval (time between the transmission of PS poll mes- 
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sage and receipt of the data packet; illustrated in Fig- 
ure 4). The IEEE 802.11 standard does not specify any 
time bound for this Wait interval. For a general purpose 
wireless access point, reducing the Wait interval has the 
inadvertent side effect of increasing the priority of data 
packets for wireless stations operating in the power sav- 
ing mode. Access point manufacturers typically asso- 
Clate power saving mode as a lower throughput state. 
They also discourage high rate multicast traffic for the 
same reason. 


In order to understand the implications of this wait in- 
terval, we developed a simple simulator that modeled 
the various power saving states of a popular WNIC. We 
modeled a 2.4 GHz Lucent wireless card that consumes 
177 mW, 1319 mW and 1675 mW while in sleep, idle 
and read states, respectively. These energy parameters 
were published in (32, 18]. We chose a beacon interval 
of 100 ms (used by Orinoco access points) and varied the 
average Wait times for reasonable values of 0 through 
1000 msec. The access points are modeled with infinite 
buffer space; as the wait times increase, more packets 
are buffered and are sent back-to-back in succession. A 
realistic access point would drop these packets once the 
buffers fill up. 


We plot the results for varying the average Wait interval 
for some of the streams measured in our earlier study 
((5]) in Figure 5. For comparison, in [5], we noted that 
a client-side technique consumes 135 (0.15% data loss), 
132 (8% data loss) and 47 (23% data loss) Joules for 
these MS Media, Apple Quicktime and Real streams, re- 
spectively. To receive these streams, the WNICs needed 
to be in active read state for 43.63%, 11.30% and5.11% 
of the time, which corresponds to a necessary energy 
consumption of 90.24, 42.94 and 33.57 Joules, respec- 
tively. From Figure 5, we note that the potential energy 
saving is heavily dependent on the average Wait inter- 
vals. Wait times of zero illustrates the necessary energy 
consumption values. For average wait times less than 
100 msec, even small increase in average Waits can dras- 
tically affect the energy saving. However, larger wait 
times do not offer much energy savings. 


Hence, we tried to measure the actual Wait times for typ- 
ical access points. We carefully disassembled the plastic 
shielding around a Orinoco Silver PC card and hooked 
two voltage probes around the two status LEDs. The 
first LED showed the card power state (transitioning to a 
high voltage stage while the card ts active) while a sec- 
ond LED showed when a data packet was transmitted 
or received. We connected the PC card to an external 
Orinoco range extender antenna to reduce the effects of 
our instrumentation on the proper operation of the wire- 
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less card. We used this instrumented PC WNIC card on 
laptops running MS Win 2000, Win 98, Redhat Linux 
7.2 and Compaq iPAQ; configured the PC card to operate 
in power saving mode and watched media streams using 
MS Media, Real and Apple Quicktime players for the 
Windows machines, Real for Linux and PocketTV and 
MS Media players for the iPAQ. For our study, we used 
Orinoco RG 1000, Orinoco AP 500 and D-Link DWL 
1000 access points. All the access points were set up on 
a dedicated LAN segment (with no background network 
traffic) and operating on the same wireless channel. 


We noticed that Orinoco RG 1000 and AP 500 access 
points used a beacon interval of 100 msec (in spite of 
our attempts at changing this interval; the Linux drivers 
provides an API to request a different beacon interval). 
The D-Link DWL 1000 allowed us to choose an inter- 
val of either 160 msec or 80 msec; the Windows dnver 
did not allow us to modify this interval and chose 80 
msec for TIM. We plot several representative results 
for the card status dunng our expenment in Figure 6. 
Note that the different plots show different parts of the 
video stream. We show the results for several low bitrate 
streams. These lower quality streams transmit less date 
and can be expected to offer considerable energy sav- 
ings. In each graph, the plot for Channel 2 (top) shows 
the duration that the card stays in active state. Channel ! 
(bottom) shows the duration when an actual packet is ei- 
ther transmitted or received at the card. Ideally, we want 
the top plot to be active for the least amount of time; 
appearing similar to the bottom plot. 


Figure 6(a) plots the results for viewing a MS stream 
at 56 kbps using the iPAQ device. From Figure 6(a), 
we note that the power save mode sometimes works op- 
timally. The first two data packets were received with 
the card in higher power state for the least amount of 
time. The third packet however triggers the card to stay 
in higher energy consuming state for a long time, the 
access point does not transmit the next packet until the 
next beacon interval (a Wait interval of 100 msec). Fig- 
ure 6(b) plots the results for viewing a Apple Quicktime 
Streaming at 30 kbps from a laptop. For Figure 6(b), 
we notice similar periods of active waiting. Also in Fig- 
ure 6(c), we notice longer wait intervals for watching a 
Real stream at 56 kbps. As noted in [5], Real tends to 
spend many smaller packets (as compared to Microsoft 
media). Such packets tend to leave the WNIC at higher 
energy consuming active state while the access points 
tries to operate “fairly” for a general audience. In fact, 
we noticed that watching any stream over 56 kbps tends 
to completely leave the WNIC in higher energy consum- 
ing active state (even though it must be possible to keep 
the Wait times lower, albeit consuming most of the avail- 
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(c) Windows 2000, Real stream at 56 kbps, 
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Figure 6: Status of IEEE 802.11b wireless PC card in 
powersa ve mode (the top plot (Ch 2) shows the WNIC 
power state and the bottom plot (Ch 1) shows data trans- 
mission/reception intervals. The x-axis shows the time 
in 100 msec ticks with voltage along the y-axis) 
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able bandwidth). 


We also experimented with the power implications of re- 
ceiving multimedia streams using multicast packets. The 
Orinoco drivers allow us to configure the DTIM interval 
(higher values would reduce the multicast throughput). 
We observed that even when the DTIM interval was the 
same as TIM interval, the access points tended to drop 
the multicast packets. In particular, the D-Link DWL 
1000 access points dropped most of the multicast pack- 
ets. Hence multicast was not a reliable streaming mech- 
anism for our purposes. 


In general, we believe that the 802.11b power saving 
mode has limitations for saving client WNIC energy 
consumption for the following reasons: 


1. Access point behavior hardware dependent: 
The potential energy saving depends on the pol- 
icy choices at both the access point and the mobile 
client station. A general access point needs to bal- 
ance the need for power saving on a single client 
with fairly sharing the available bandwidth. Even 
with no other clients to share the bandwidth, the 
access points tested still tended to keep the clients 
waiting for extended periods of time. 


2. Does not co-exist for multiple clients: For mul- 
tiple clients accessing multimedia streams simulta- 
neously, the TIM specifies all the clients with pend- 
ing network packets. The standard does not define 
the order in which client requests are actually ser- 
viced. Clients could wait for the whole time needed 
to transmit packets for all the clients, even though 
the PS poll requests were all sent at the same time. 


The mobile station can delay the transmission of 
the initial PS poll message to allow the access point 
to transmit data packets for other nodes. However, 
we are not aware of any such protocol defined by 
the standard to control the client PS poll interval. 


3. Uses fixed TIM interval for all clients: Even 
though the standard does not explicitly prevent the 
access point from dynamically varying the beacon 
interval to accommodate the observed traffic levels, 
none of the access points that we tested changed 
the beacon interval. The choice of the beacon in- 
terval is a trade-off between the average packet de- 
lay at the access point and the periodicity of client 
wakeup intervals. As the data rate increases, reduc- 
ing the TIM interval can reduce the packet delay 
at the access point. With multiple clients in power 
saving mode of operation, there is a need for per 
client TIM intervals. 
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4. Power save operation not application 
aware: With the notification(TIM) -- 
request(PSpoll) — datatransmission model 
of operation, the IEEE 802.11b standard assumes 
that the data traffic is sporadic and well behaved 
(few packets spread evenly). On the other hand, 
multimedia streams tend to be isochronous. For- 
mats such as Real tends to transmit smaller packets 
at close intervals. Formats such as MS media 
tends to utilize network fragmentation leading 
to fragmented packets sent closer to each other. 
Delaying parts of a fragment can delay the delivery 
of the entire packet to the multimedia browser. 


4.1 Whitecap technology and IEEE 802.lle 
standard 


The upcoming IEEE 802.1 le will incorporate the white- 
cap [9] technology to provide QoS guarantees for mul- 
timedia. The technology provides contention free ac- 
cess for multimedia traffic by reserving portions of the 
transmission spectrum for periodic multimedia traffic. 
It seems possible for clients to operate in power save 
mode; transitioning to an active state to receive multi- 
media streams. The standard is still in the draft stages. 


5 Implications of energy aware multime- 
dia service 


In the last section, we discussed the limitations of uti- 
lizing the 802.1 1b MAC power management scheme for 
popular streaming media formats. In this section, we 
explore the implications of modifying the origin server 
(through a proxy) to customize the streams so that the 
clients can frequently transition the WNICs to lower en- 
ergy consuming states. 


In general, any traffic shaping at the origin server will be 
affected by the loss and delay characteristics of the wide- 
area Internet. Traditionally, buffering at the client had 
been used to offset these delays. However, the clients 
need to know the exact time of arrival for the first packet 
in order to minimize data loss. The loss of the last packet 
in a stream (which indicates the arrival of the next packet 
stream) affects the potential energy saving. Also, mul- 
tiple clients using the same access point but receiving 
different streams would experience delays because of 
competing data reception characteristics. In summary, 
modifying the origin servers to shape the network traffic 
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Table 1: Energy consumed and % packets dropped by a client-side history based approach (detailed discussion in [5]) 
Client-side | Client-side adaptation _| 
Bytes 
dropped (%) 
l 


Stream b/w Energy __jpeentside adoprarion | 
(in ae (in Joules) | (in Joules) 
Microsoft 
Media 


149.7 
150.1 
149.2 
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canenableclients to frequently transition to lower power 
states has the following drawbacks: 


e Loss of the control packet in a stream can adversely 
affect the energy savings. Throughout this work, 
we assumed that the wireless network does not suf- 
fer from noticeable multimedia data packet loss. In 
general, if the control packet specifying the client 
sleep interval from the local proxy is lost, then the 
client will wait in a higher power consuming idle 
state. Losing the data packets itself would trigger 
high level mechanisms that adapt the stream to a 
lower fidelity stream. 


e Multiple local clients receiving streams from differ- 
ent servers can lead to conflicting schedules on the 
network. 


e Any packet delay in the wide area can leave a client 
in a higher power consuming state. 


We implemented our policies on a local proxy based ar- 
chitecture. The local proxy buffers the multimedia pack- 
ets and periodically transmits them to the client (similar 
in spirit to the PHEEE 802.1! beacons). The local proxy 
could also dynamically inform the client-side of the next 
packet arrival time using a special control packet. The 
client-side proxy uses the interval between transmissions 
to transition the client WNIC to a lower energy consum- 
ing sleep state. The client-side proxy informs the local 
proxy server of the access point that it is associated with. 
The local proxy schedules the transmissions in such a 
way to avoid conflicts with other clients using the same 
wireless access point. 


First we analyze the energy savings for the various pop- 
ular streaming formats (MS Media, Real and Apple 
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Figure 7: Energy metric for MS media streams 


Quicktime) using streams customized for different net- 
work bandwidth requirements. We also explore the im- 
plications of multiple clients sharing the same wireless 
access point. For reference, the energy savings and the 
associated data loss for the various streaming formats us- 
ing a client based adaptation strategy (that was discussed 
in [5]) is tabulated in Table 1. 


5.1 Single wireless client associated with the 
wireless access point 


First we perform experiments to explore the energy sav- 
ing possible with a cooperating proxy that enables the 
clients to transition to lower energy consuming sleep 
state. We explore the implications for popular streaming 
formats (Microsoft media, Real, Apple Quicktime). We 
configured the local proxy to transmit buffered packets 
after a delay of 50, 100 and 200 msec. 


In some cases (e.g. high fidelity streams), depending on 
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Figure 9: Energy metric for Quicktime streams 


the delay introduced, the multimedia players automati- 
cally (and incorrectly) adapted to the delays by lower- 
ing the stream fidelity. In order to compare the energy 
consumption in such a scenario, we used the notion of 
energy metric, defined as the amount of energy required 
to download multimedia data (denoted in Joules/KB). It 
is preferable to reduce the energy metric. 


5.1.1 Microsoft media streaming format 


We plot the energy metric for video streams customized 
for 56 kbps, 128 kbps, 256 kbps, 768 kbps and 2000 
kbps bandwidth streams in Figure 7. From Figure 7, 
we note that a proxy that transmits packets every 200 
msec can offer energy savings for low fidelity stream 
streams (without any data loss associated with a client- 
only scheme [5]). For a 56 kbps stream, a server that 
transmits packets every 200 msec can reduce the energy 
consumption by as much as 83%. However, for high 
bandwidth streams (low bandwidth streams are them- 
selves transmitted infrequently) and increased delays, 
the media player adapts to increasing delays by reduc- 
ing the stream fidelity. This results in reduced energy 
consumption while increasing the energy metric. In fact, 
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the energy metric can sometimes be worse than (e.g. 768 
kbps stream) receiving the original unmodified streams. 
Such adaptation can be counteracted by buffering the 
network packets in the client-side proxy and locally de- 
livering them to the multimedia player at a more regu- 
lar pace. Note that such additional buffering introduce 
their own energy requirements for maintaining the local 
buffers. 


Also, we used the nanoslee p() system call in the local 
proxy to delay packets for delivery. General purpose op- 
erating systems such as FreeBSD and Linux schedule 
sleeping jobs at 10 msec intervals and hence the actual 
sleep intervals can be at least 10 msec more than the 
programmed value. We noticed that this extra interval 
tends to be more than 10 msec for processes that sleep 
for longer intervals of time. The client-side proxy ac- 
tively waits for packets tn this extra 10 msec in a higher 
energy consuming idle state, leading to increased energy 
consumption and higher energy metric. Real time sched- 
ulers for Linux can reduce this scheduling interval to 2 
msec. We are currently investigating such schedulers to 
further reduce the energy consumption. 


5.12 Realstreaming format 


We repeat the experiments from last section for Real 
streams transmitted at 56 kbps, 128 kbps, 256 kbps and 
512 kbps and plot the corresponding energy metric in 
Figure 8. We configured the real player to not trans- 
mit the stream reception quality feedback to the origin 
servers and hence the system did not adapt to the packet 
reception delays. We noted that the server assisted poli- 
cies offer substantially better energy saving than sim- 
ple client-side only policies without any associated data 
loss. For example, a 56 kbps stream with a local proxy 
that transmits packets every 100 msec only consumed 28 
Joules (as compared to 116 Joules and 4% data loss for 
client-side history based mechanisms). Also, recall] that 
Real typically streams the video streams quicker than the 
other formats. Introducing high delays (e.g. 200 msec) 
prolonged the stream transmission, adding extra sleep 
cycles and slightly increasing the energy metric. 


5.1.3 Quicktime streaming format 


In the last two sections, we explored the potential en- 
ergy Saving in a server that transmits the stream pack- 
ets in predictable intervals for Microsoft media and Real 
streams. In this section, we repeat the experiments for 
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Table 2: Energy consumed and % packets dropped by 
the client-side history based approach for 2 simultaneous 
clients in the same wireless access point 


Stream Stream Energy Bytes 
| Format b/w (in Joules) | dropped (%) 


56 Kbps 
128 Kbps 
256 Kbps 
768 Kbps 
2000 Kbps 


Microsoft Media 


27.17 
27.96 
32.94 


29.71 
49.22 
41.49 


56 Kbps 
128 Kbps 
256 Kbps 


Apple Quicktime streams at 56 kbps, 128 kbps and 256 
kbps and plot the results in Figure 9. From Figure 9, 
we notice the potential energy savings without the asso- 
ciated high data loss rates (illustrated in Table 1). Note 
that Quicktime transmits the audio and video portions of 
the stream using separate UDP stream channels. The lo- 
cal proxy needs to be aware of these independent streams 
in order to schedule the data transmissions. 


5.2 Implications of multiple clients using the 
same wireless access point 


In the last section we explored the potential energy sav- 
ings for a simple server policy that transmits packets at 
predictable intervals (similar to the IEEE 802.11 MAC 
level power saving mode). We performed our experi- 
ments on a dedicated WLAN environment with a single 
wireless client. We showed that the potential energy sav- 
ings over client-side policies. We discussed the effects of 
operating system scheduling policies that can reduce the 
potential energy saving. We also identified media player 
adaptation to increased packet delay and their effects on 
energy metric. 


However, in a typical operating scenario, there could be 
many mobile clients within a single access point shar- 
ing the same physical medium. The local proxy with the 
knowledge of the physical WLAN limitation can sched- 
ule the clients such as to minimize data contention at the 
network. In this section, we explore the implications of 
two clients using the same wireless access point, con- 
suming the same video stream (slightly offset in time to 
avoid the same data from being multicast just once). 
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Figure 11: Energy metric for two simultaneous Real 
streams 


First we tabulate the energy saving and the associated 
data loss for a client-side history based policy (described 
in [5]) in Table 2. Table 2 shows the inherent limita- 
tions of a client-side history based policies in a network 
with high contention. Both Real and Quicktime streams 
as well as high bandwidth Microsoft media streams ex- 
perience higher data loss compared to a single client 
case shown in Table 1. Low bandwidth Microsoft media 
streams are transmitted at fairly regular intervals which 
are not affected much by the increased network con- 
tention. 


5.2.1 Microsoft media streaming format 


We performed experiments with a local proxy servicing 
two clients accessing the same stream of identical stream 
quality. We plot the energy metric for the various Mi- 
crosoft media streams in Figure 10. We notice similar 
results as in the single client case (Figure 7), showing 
the potential for a application level contention reduction 
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Figure 12: Energy metric for two simultaneous Quick- 
time streams 


mechanism for energy conserving policies. Note that our 
WLAN only supports an effective through put of about 4 
Mbps. Using two 2 Mbps streams saturates the wireless 
network; adding additional delays to the steams worsens 
the contention interval leading to the player adapting to 
a lower bandwidth stream. 


5.2.2 Real streaming format 


We repeated the experiments for Real streams and plot 
the results in Figure 11]. For the most part, two clients 
provide similar savings as a single client case. How- 
ever, Real players inexplicably experience fatal errors 
particularly operating with a delay of 50 msec. We are 
presently investigating why the Real players crash when 
two players are simultaneously accessing two multime- 
dia streams. 


5.2.3 Quicktime streaming format 


We continue with our analysis for the various Quicktime 
Streams and plot the results in Figure 12. From Figure 
12, we notice similar performance gains for two clients 
Simultaneously accessing multimedia streams. Again, 
we notice that Quicktime clients adapt to any introduced 
delays by lowering the stream quality. A caching client- 
Side proxy can offset this client behavior. 


In this section, we showed the effectiveness of applica- 
tion aware local proxy in shaping the multimedia traffic. 
The system can not only allow the clients to better man- 
age their energy, but also schedule the packets to avoid 
local WLAN network contention. The Quicktime multi- 
media players adapt to any introduced delays by lower- 
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ing the stream quality. A caching client-side proxy can 
help offset such client behavior. 


6 Discussion 


In this paper, we show the limitations of IEEE 802.11 
power saving mode for receiving popular multimedia 
streams (Microsoft media, Real and Quicktime). We 
showed that the non-deterministic TIM interval and the 
associated Wait interval can adversely affect the poten- 
tial energy savings. We showed that the WNICs effec- 
tively switch out of the power saving modes for even 
moderately high bandwidth streams (128 kbps). Also, 
a single TIM can reduce the energy savings for multiple 
clients contend ing for the same TIM beacons by increas- 
ing the wait intervals for each stream. 


On the other hand, an application-specific server side 
traffic shaping mechanism can offer good energy saving 
for all the stream formats without any data loss. We use 
a simple server enhancement to transmit network pack- 
ets at predictable intervals. We show that we can reduce 
the energy metric (Joules/KB) by as much as 83%. We 
show how our approach can offer similar benefits for two 
clients sharing the same wireless access point. We note 
that the operating system scheduling mechanisms induce 
additional latencies that reduce further potential energy 
savings. Also, Quicktime players adapt to any network 
delays by lowering the stream quality. 


Our work makes the fol lowing contributions towards the 
design of such application specific energy-aware net- 
work traffic shaping mechanisms: 


e We show that the amount of energy saving delays 
introduced depends on the stream requirements; 


lower fidelity streams are more tolerant to longer 
delays. 


e Operating system scheduling mechanisms can re- 
strict choos ing too small values of these delays. 


e These mechanisms have to take the stream format 
into consideration. Formats such as Quicktime that 
transmit a given media stream using at least two in- 
dependent channels (one each for audio and video) 
should be treated properly so as to avoid conflicts 
within the same application. 


Additional information such as the associated ac- 
cess point can also help avoid network media con 
tention issues. 
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e Future media players should provide a configurable 
mechanism for specifying wireless networks such 
that these energy saving transitions do not tngger 
network congestion response. 
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Atul Adya, Paramvir Bahl, Lili Qiu 
Microsoft Research 
1 Microsoft Way, Redmond, Washington 98052 
{adya, bahl, lilig}@microsoft.com 


Abstract 


There is a fair amount of evidence that suggests that In- 
temet access from wirelessly-connected mobile hand- 
held devices is gaining popularity. However, there 
haven’t been too many studies that have focused solely 
on analyzing the wireless Internet. In this paper, we 
study the notification and browse services provided by 
a large commercial web site designed specifically for 
users who access it via their cell-phones and PDAs. Un- 
like previous web studies that have analyzed browse ser- 
vices provided over wired networks, we focus primarily 
on browse and notification services provided over wire- 
less channels. Specifically, we analyze the notification 
and browser traces to understand the system load, the 
type of content accessed, and user behavior. We discuss 
the implications of our findings for techniques such as 
multicast, query caching and optimization, and transport 
protocol design. 


1 Introduction 


Over the last decade the cellular phone industry and 
the World Wide Web have experienced a phenomenal 
growth as people around the world have embraced these 
technologies at a remarkable rate. Today, most major 
wireless service providers in the United States, Europe, 
and Japan offer wireless Internet services and many In- 
ternet companies provide content that has been adapted 
to suit the limited display, bandwidth, memory, and pro- 
cessing power of small devices. 


Another emerging trend, related to wireless Internet, has 
to do with how users manage the gigantic information 
flow that the Internet provides. Realizing that users are 
being overwhelmed with information, several web con- 
tent providers allow users to switch their data access 
model from polling and navigation to notifications or 
alerts. Instead of periodically browsing through the web 
sites for potentially useful information, an increasing 
number of users are adopting the model where they reg- 


ister for information in which they are interested. These 
users provide a callback address usually in the form of 
an email address, a cell-phone number, or a pager num- 
ber, depending on their perceived importance of the in- 
formation. Whenever the relevant event is triggered, the 
content provider sends a notification to the user. Exam- 
ples of some US companies that provide such notifica- 
tions include Yahoo Mobile, MSN Mobile, AOL Any- 
where, and InfoSpace. All of these services allow users 
to subscribe to alerts for stock quotes, sports scores, lot- 
tery, horoscope, calendar events etc. If alert services 
becomes a popular form of user interaction with the 
web, it will be critical for content provider and content 
management companies to handle these notifications ef- 
ficiently. Proper management of notifications involves 
understanding which types of notifications are popular, 
which types of devices are used by subscribers for re- 
ceiving notifications, the frequency of sending these no- 
tifications on a per user basis, etc. 


In this paper, we study notification and browse services 
provided by a large popular commercial web site that 
is designed specifically for US users who access it via 
their cell-phones and PDAs. Unlike most previous web 
studies, which have analyzed browsing services pro- 
vided over wired networks, we focus primarily on a 
web server that delivers notification and browsing ser- 
vices over wireless channels. We analyze notification 
and browser traces to understand the system load, the 
type of content that is accessed, and user behavior. We 
believe that our study is important for content providers, 
wireless ISPs, and web site managers. 


We note here that we do not study the performance of 
the web server subsystem or its architectural design. In- 
stead, we use web server logs to analyze the browse and 
notification patterns of wireless web users. 


The rest of this paper is organized as follows. In Sec- 
tion 2 we review previous work done in the field of web 
trace analysis. In Section 3, we describe the different 
ways in which the web site is accessed, the characteris- 
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tics of the data logs, and the types of analyses we carry 
out. We present detailed analysis of the notification and 
browse logs in Sections 4 and Section 5, respectively. 
In Section 6, we examine the degree of correlation be- 
tween the usage of browse and notification services. We 
conclude in Section 7. 


2 Related Work 


There have been a number of studies on the access dy- 
namics of web servers servicing clients over a wired 
network. These studies include analyses of web ac- 
cess traces from the perspective of proxies [7, 20, 21 ], 
browsers [6, 9], and servers [4, 16]. However, to our 
knowledge, all previous web workload studies have been 
conducted for browse services only and there are no pub- 
lished studies on notification services. Consequently, we 
believe, our analysis of notification services is the first 
study of its kind. 


Even for the browsing services, most studies analyze 
web servers serving clients over wired networks. There 
are very limited studies on web servers serving clients 
over wireless channels. The study closest to ours is 
the one done by Kunz et a/. [12], which analyzes net- 
work traces generated by a mobile browser application. 
Specifically, their paper analyzes user behavior (bytes 
transferred and time spent on the wireless link) based 
on the notion of a session that was chosen to be 90 sec- 
onds; however, a diffierent session period could poten- 
tially change their results. The main limitation of their 
work is the size of the data analyzed: although the traces 
were collected over a period of seven months, only 80K 
entries were logged. It is unclear whether the inferences 
drawn from this study can scale up to large commercial 
sites. In contrast, we analyzed traces with millions of 
entries generated over a period of 12 days at a large com- 
mercial site. Furthennore, their study also has the limita- 
tion that it uses client IP addresses for identifying users; 
since IP addresses can be reassigned to different users, 
it is difficult to perform an accurate user-based analy- 
sis. In our study, since every entry in the logs contains 
a unique identifier for every access/notification, we are 
able to carry out user-behavior analysis more accurately. 
In addition, our study is broader as we focus on user 
behavior, server load, content, and document popularity 
analysis. 


Tang and Baker analyzed a seven-week trace of a 
metropolitan-area packet radio wireless network, and a 
twelve-week trace of a building-wide local-area wireless 
network [18, 19]. Both studies focus on how the net- 
works were used, e.g., when the networks were most ac- 
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tive, how active the network were, and how often users 
moved, etc. They did not consider the content or ap- 
plications for which people used the wireless networks, 
which Is the focus of our paper. 


Recently, Balachandran er al. [5] analyzed the user be- 
havior and network performance of an IEEE 802.11 
based wireless local area network (LAN) using a work- 
load captured at a three day technical conference event. 
Their study focused on characterizing wireless LAN 
users for the purpose of coming up with a parameterized 
model to describe them. Additionally, they carried out 
workload analysis to address the network capacity plan- 
ning problem. Their study is very different from ours in 
terms of analysis, methodology and objectives. While 
we focus primarily on wireless browse and notification 
services, they consider all network traffic for improving 
the network performance. Furthermore, the data-set they 
captured and analyzed is smaller and significantly differ- 
ent from the web server traces we analyze. 


In the sections that follow, whenever appropriate, we re- 
fer to related work done by other researchers and com- 
pare it with our findings. 


3 Data Characteristics 


Before presenting the analysis, we briefly describe the 
different ways in which the web site is accessed, the 
characteristics of the data logs, and the types of analyses 
we carried out. 


For the web server we used in this study, a single browse 
request results in exactly one HTTP request to the server. 
There are no images or other types of content embedded 
in the page that is transmitted to the client as a result of 
this request. 


In the rest of the paper, we use the term notification doc- 
ument to refer to a unique document that may be sent to 
multiple users; we refer to each such transmission as a 
notification message, which includes duplicates. 


3.1 Types of Accesses 


For browsing, the web site is accessed in three differ- 
ent ways and we categorize the browse accesses based 
on this usage: desktop, offline, and wireless. Desktop 
accesses include requests from desktop and laptop ma- 
chines connected to the web site via wireline networks. 
Offline accesses are generated due to handheld devices 
such as PDAs. Companies such as Avantgo and Vindigo 
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offer services that let users select content from different 
web sites and download it onto a handheld device for 
browsing at a later time. The content download occurs 
when a user synchronizes his/her handheld with a desk- 
top machine and is controlled by a “downloader” pro- 
gram; we refer to these programmatic accesses by the 
downloader as offline accesses Wireless accesses occur 
due to browse actions initiated by users from their cell- 
phones or wireless devices. Typically, a request from a 
cell-phone is directed to a “gateway” (operated by the 
user’s service provider) that forwards the message to the 
web site; this gateway also forwards the reply back to 
the cell-phone. Thus, from the web site’s perspective, it 
Just communicates directly with the gateway machines 
using the standard HTTP protocol. Since one gateway 
can serve multiple clients, we do not use IP addresses 
to identify users; instead, we use a unique identifier as- 
signed to every client that is logged with each access. 




















No. of accesses 
7,342,206 
2,210,758 
20,508,272 
2,944,708 


Browser Type 
Deskt op 
Wireless 
Offiine 
Misc 


No. of users 


639,971 
58,432 
50,968 
1,634 










Table 1: User accesses according to browser types 


We determine the type of access based on the browser 
type stored in the log entry corresponding to that ac- 
cess. For example, entries wth browser type “Mozilla 
Windows”, “Avantgo”’, “UP.Browser” are categorized as 
desktop, offline and wireless accesses respectively. In 
Table | we show the number of accesses according to 
the browser type (in our case, each access corresponds 
to a single HTML page). The last category (Misc) cor- 
responds to log entries for which the browser type ei- 
ther was empty or contained characters that could not 
be mapped to any known browser client. The table also 
shows the number of unique users that were respons- 
ble for different types of accesses Note, the number of 
desktop users is much higher than the offline and wire- 
less users due to the fact that a large number of users use 
their desktop machines to register with the web site. 


In the case of notifications, there is a client type in the 
logs that tells us the type of the registered clients. Mare 


than 99% of the messages were sent to wireless clients; 
the remaining were sent to desktop clients. 


3.2 Description of Data Logs 


We had access to logs for 12 days of web browsing from 
August 15, 2000 through August 26, 2000. There were 


approxi matel y 33 million entries in the browse logs. Ad- 
ditionally, we used notification logs from August 20, 
2000 through August 26, 2000, which contained 3.25 
million entries. For our analysis of the correlation be- 
tween browse and notification services (Section 6), we 
obtained additional notification logs and performed the 
comparison for the period from August 15, 2000 through 
August 26, 2000. 


When a registered user sends a browse request to the 
web server, a unique identi fier corresponding to the user 
is sent to the server and logged in the web traces (for 
unregistered users, the id field is empty). We use these 
identifiers for perfor ming the user-based anal ysis. Each 
log record also contains other pieces of useful informa- 
tion along with the user ids, such as the date, time, type 
of browser, the URL accessed, the data recei ved and sent 
by the server, etc. 


When a notification message is sent, a record is logged 
in a database. We obtained a part of this database for 
our analysis. The database entries contained informa- 
tion about the server from where the notification mes- 
Sage was sent, a user id, type of the device to which the 
message was sent (e.g., phone or pager), type of alert, 
when it was Sent, etc. 


To efficiently manipulate a large amount of data logs 
(over 10 GB), we consolidated them into a commercial 
database system and created indices on columns such 
as date, user id, and URL. To overcome the limited ex- 
pressiveness of our database language (interms of string 
manipulation), we further processed the database output 
using Perl scripts. 


3.3. Types of Analyses 


We now discuss the types of analyses that we perform 
on the notification and browse logs, and the motivations 
for doing these analysis. 


I. Content analysis: We are interested in questions 
such as: (i) what are the most popular content cat- 
egories, and (ii) what is the distribution of message 
sizes? We believe such questions are important to 
(1) content providers who need to understand better 
how to prioritize and use the system and network 
resources efficiently, and to (ii) web site develop- 
ers who are interested in supporting fast access to 
popular content. 


2. Popularity analysis: We are interested in the pop- 
ularity distribution of notification and browse doc- 
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uments. In particular, we are interested in compar- 
ing these accesses to the well-known Zipf-like dis- 
tribution as reported in previous web studies [4, 7, 
10, 14, 16], and in determining how concentrated 
are the number of requests/transmissions for popu- 
lar documents. This has significant implication for 
the effectiveness of web caching and multicast de- 
livery. 


3. User-behavior analysis: We are interested in clas- 
sifying users according to their access patterns. 
This is useful for personalization, targeted adver- 
tising, prioritizing, and capacity planning. Specif- 
ically, we look at the following aspects of user be- 
havior: 


e Spatial Locality: whether users in the same 
geographical region tend to receive/request 
similar notification and browsing content. 


e Jemporal Stability: whether users are inter- 
ested in browsing similar documents over 
time. 


e User Load Distribution: how different users 
place load on the web site; for service 
providers, this distribution has implications 
on pricing. 


4 Notification Log Analysis 


Table 2 shows the overall statistics for the notification 
logs. In one week, the server sent out 3.25 million no- 
tification messages for a total of 295 megabytes. One 
fourth of the messages sent out were distinct, while the 
remaining messages had the same content but sent to dif- 
ferent users (In some cases, the same message Is sent to 
a user multiple times, e.g., 1f a user has registered for in- 
formation to be delivered at specific times and the infor- 
mation has not changed during that period). The signifi- 
cant amount of duplication in messages sent to different 
users suggests that sending notification via application- 
level multicast would be useful; Section 4.2 examines 
this issue in greater depth. There were 200,860 distinct 
users, of which 99.02% were wireless users. The notifi- 
cations were sent at the average rate of 323 messages per 
minute. The peak rate was much higher, approximately 
30 times as high as the average rate. 


4.1 Content Analysis 


We begin our analysis by looking at the content of the 
notifications sent to various users. 
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5251.57 
Total distinct messages 884,272 
Total bytes transmitted 295 MB 


Total bytes of unique messages | 71.3 MB 
transmitted 


9502 (mses/min) 








Table 2: Overall statistics for the notification logs for the 
period from Aug 20 through Aug 26, 2000. 


4.1.1. Popular Categories 


We classified the notifications into categories based on 
the subject field, which was recorded in the notifica- 
tion logs. We plotted the number of messages sent for 
each notification category in Figure 1, and the number 
of users who received the notification message for each 
category in Figure 2. 
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Figure |: The total number of notifications sent for each 
category. 
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Figure 2: The total number of users who received notifi- 
cations for each category. 


As Figure | shows, email, weather, news, stock quotes, 
sports, and horoscopes are the most popular categories 
in terms of the total number of notification messages. 
In comparison, weather, email, horoscopes, news, and 
stock quotes are the most popular categories in terms 
of the total number of users (see Figure 2). As we had 
expected, email alerts were very popular. On the other 
hand, we had not expected weather-related notifications 
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to be so popular. Intuitively, one might have expected 
stock quotes and news to be more popular, especially 
since users have to explicitly register for different noti- 
fication types (including weather), 1.e., notifications are 
not being sent due to some default setting on the user- 
signup page. Another surprise was the low popularity 
of calendar alerts., For calendar alerts, it 1s possible 
that subscribers use handheld devices that are not con- 
nected to the wireless Internet, for example, PDAs with 
pre-installed software to handle scheduled meetings, an- 
niversaries, etc. 


OWeekday @ Weekend 
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Figure 3: Change of user interest between weekday and 
weekends 


Next we analyzed how user interest changed during the 
course of a week. Figure 3 shows a comparison between 
the amount of notification data sent on a weekday versus 
a day on the weekend. As one would expect, there is a 
significant difference between the number of stock quote 
alerts sent during the weekday compared to those sent on 
the weekend. Similarly, there are fewer mail alerts on 
weekends; this 1s probably due to lower levels of work 
activity that occur on weekends relative to weekdays, re- 
sulting in fewer triggering events. For other categories 
(e.g, sports, weather, horoscopes), the number of notifi- 
cation messages does not vary significantly over week- 
ends and weekdays. We attribute these patterns to the 
fact that not many users personalize all aspects of their 
notification portfolio in a very fine-grained manner (for 
event types such as weather, the web site allows users to 
select the frequency and the time of delivery). 


4.1.2 Notification Message Size and Its Implications 


We find that notification messages are small. Specifi- 
cally, all messages contain less than 256 bytes. We show 
the message size distribution in Figure 4 to illustrate this 
point. Consequently, it is important for the delivery pro- 
toco] to handle small messages efficiently. For example, 
if the protocol creates a new TCP connection for every 
notification message, the overhead can be high. In par- 


ticular, the connection establishment may increase the 
user-perceived latency by a factor of 3 (i.e, from one 
half round-trip time to one and a half round-trip time). 
Assuming the average notification message size to be 
128 bytes, the connection setup and tear-down increases 
the bandwidth usage from 168 bytes per message to 448 
bytes per message (1.e., 7 additional packets: 3 pack- 
ets in the three-way handshake connection setup, and 4 
packets in the connection teardown). 
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Figure 4: Size distribution of notification messages (in- 
cluding duplicates). 


One suggestion for reducing the overhead of connection 
setup and teardown Is to use persistent connections [13], 
1.e., reuse a TCP connection for multiple transfers. In 
our case, the servers sending the notification messages 
can maintain persistent connections with the gateways 
of the wireless ISPs and then send all messages on this 
connection. 


4.2 Message Popularity Analysis and Its Impli- 
cations 


Several studies have found that web accesses follow 
Zipf-like distribution: the number of requests to the i** 
most popular object is proportional to ;4 [3, 4, 6, 7, 10, 
14, 16]. The estimates of a range from 0.5 to I for web 
proxy logs [7, 10, 14], and range from 1 to 2 for web 
server logs [4, 16]. It is interesting to examine whether 
notification messages exhibit a similar property. 


To do the above, we take the following approach: For 
each notification document, we count the number of 
notification messages (i1.e., copies) that were sent on a 
given day. We plot the total number of transmissions 
of a document (i.e., notification messages) versus the 
popularity ranking of the document on a log-log scale. 
Figure 5 shows the plot for August 21, 2000. The 
plots for the other days are similar, and are omitted for 
brevity. lf we ignore the first few notification documents 
and the flat tai] in Figure 5 (as is done in the previous 
work [6, 7, 16]), we note that the curve fits a straight 
line reasonably well. We compute the values of @ us- 
ing least-square fitting, after excluding the top 20 doc- 
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uments and the flat tail (the latter set represents the no- 
tification documents that were sent only once or twice). 
The straight line on the log-log scale implies that the 
notification documents follow a Zipf-like distribution. 
We find that for our complete data-set the value of a 
varies from 1.137 to 1.267 (in Figure 5, the value of a 
is 1.146). These values are higher than the a@ in the web 
proxy logs [7, 10, 14], and lower than (but close to) the 
a observed for popular web server logs [16]. 
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Figure 5: Frequency of notification documents versus 
ranking in log-log scale (for August 21, 2000). 


Figure 6 shows the cumulative distribution of notifica- 
tion documents on August 21, 2000. The top 1% of 
notification documents (1.e., 1704) account for 54.24% 
of the total notification messages. In the logs for 
other days, the top [% of notification documents ac- 
count for 54.15% - 63.66% of the total messages. Such 
a high concentration of messages containing popular 
documents suggests that using application-level multi- 
cast [8, 11, 17, 22] for popular documents would yield 
significant savings in both bandwidth and server load. 
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Figure 6: Cumulative distribution of notification mes- 
sages to documents (for Aug 21, 2000). 


A possible optimization is to distribute a set of caches 
over the Internet to form an overlay multicast tree rooted 
at the notification server. When a notification message 
needs to be sent to multiple recipients simultaneously, 
it can be sent over the overlay tree and also stored at 
the caches that it traverses. These caches can help in 
offloading the retransmission work (say, due to a client 
coming online) from the server: when the same copy of 
notification needs to be sent at a later time, the caches 
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closest to the receiver can forward the message 


Note that even though the current notification traffic is 
not significant, as the popularity of notification services 
increases, bandwidth usage will become an important 
factor for scaling the notification system. Consequently, 
optimizations such as application-level multicast will 
become more important. 


We also observed that the concentration of notification 
messages to documents becomes less pronounced as the 
number of the documents considered increases. For ex- 
ample, the top 7.6% — 42.0% of the documents account 
tor 80% of the total messages, and the top 45.1% —- 
71.0% of notifications account for 90% of the total mes- 
sages. This implies that a large perfiormance benefit can 
be obtained by multicasting only the most popular noti- 
fication documents. 


4.3. User Behavior Analysis 


We now study two aspects of user behavior: (i) the spa- 
tial locality of user interest, and (ii) the distribution of 
load that users place on the server. 


4.3.1 Spatial Locality 


Spatial locality of user interest is about determining 
whether people in the same geographical region tend to 
receive similar notification content. To carry out our 
analysis we take the following approach. We define a 
notification message to be locally shared if at least two 
users in the same cluster receive the notification. We 
compare the degree of sharing using geographical clus- 
tering and four random clusterings. In the geographical 
clustering case, clients in the same city are clustered to- 
gether. In the random clustering case, clients are clus- 
tered randomly with the cluster size being the same as 
in geographical clustering. We obtained the geographi- 
cal location of users using a registration database which 
contains zip code information for each user. The zip 
code information is not clean — some users supplicd 
invalid zip codes; we filter out all the zip codes that are 
not 5 digits. 14% of the users supplied such invalid zip 
codes. In the remaining entries, it is still possible to 
have zip codes that do not match the actual user loca- 
tion, but the fraction is likely to be small. Furthermore, 
when computing the degree of local sharing, we exclude 
the cities to which fewer than 100 notification messages 
were sent over the course of the week. 
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As shown in Figure 7, clients residing in the same city 
have significantly more sharing in notification content 
compared to the clients picked at random. We also com- 
pared geographical clustering with three other random 
clusterings and observed similar results. The higher de- 
gree of sharing in notification messages for clients in 
the same geographical region indicates that localized 
services are popular for notification services. For ex- 
ample, people living in New York are interested in re- 
ceiving notification messages about weather or events 
in New York. The geographical locality in notification 
content implies that placing servers (1.e., either notifi- 
cation server replicas or servers in an overlay network 
that provide application-level multicast) close to popular 
geographical clusters can be useful in reducing network 
load. 
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Figure 7: Compare the local sharing between random 
clients and clients that are geographically close together. 


4.3.2. Load distribution of different users 


On average, we observed that a user receives 2.3 notifi- 
cation messages containing a total of 0.2 K Bytes per day, 
and 16.1 notification messages containing 1.4 KBytes of 
data per week. There is a significant variation in the 
clients’ usage — during the week that we studied, some 
clients received over 1000 messages (containing as high 
as 0.1 MB of data), while other clients received fewer 
than 10 messages containing as little as a few hundred 
bytes of data. 


Figures 8 and 9 show the total number of messages and 
the total number of bytes received by different users on a 
log-log scale, respectively. Both curves fit very well with 
a Straight line (1.e., follow Zipf-like distribution), except 
at the tail where there is a sudden drop. We compute the 
values of &@ using least-square fitting, after excluding the 
sharp drop at the tail. The value of @ is 0.4437 when 
usage is defined as the number of messages; when usage 
is defined as the number of bytes, its value is 0.4567. 


| —Trace ——Least square line ft | 





~— 
ao 
ae 
% 
ag 
E 
ce 
2 
o 
= 
a 
ec 
3 
1 10 100 1,000 10,000 100,000 1,000,000 
User ID 


(sosted by the total numberof notification messages) 


Figure 8: The total number of notification messages re- 
ceived by different users. 
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Figure 9: The total number of notification bytes received 
by different users. 


To further study how usage is distributed across differ- 
ent clients, we plot the cumulative distribution of client 
usage in Figure 10. As the figure shows, the top 5% 
of the clients received 28% of the notification messages, 
and 25% of the notification bytes; the top 10% of the 
clients received 40% of the notification messages, and 
38% of the notification bytes. It is clear that a small 
fraction of users consume a significant fraction of the 
system and network resources. It is also interesting to 
note that the CDF curves are similar for the two differ- 
ent ways of defining usage. The similarity of the curves 
shows that each user receives a similar number of bytes 
per message. 
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Figure 10: Cumulative distribution of different clients’ 
usage. 


The cumulative load imposed by all users (in tenns of 
number of messages and the number of bytes sent by the 
servers) 1s shown in Figure 11. The figure shows that the 
number of messages and the number of bytes are fairly 
constant during weekdays but exceed the number sent 
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Figure 11: Number of bytes and messages served by the 
notification servers during the days in the week 


during the weekend. This confirms what one would ex- 
pect, 1.e., information alerts are more frequently gener- 
ated when people are working. 


4.4 Summary 


Our analysis shows that notification messages are small, 
popular documents account for a significant fraction of 
the messages, and there exists a high degree of sharing in 
geographical regions. System designers need to develop 
transport protocols that can send such messages in a reli- 
able, efficient and secure manner. For example, an over- 
lay network consisting of geographically placed caches 
along with application-level multicast can reduce the 
total network bandwidth requirements and server load. 
We also observed that there is a significant variation in 
clients’ usage of notification services. Service providers 
can design pricing plans according to the needs of the 
clients and also specialize content based on geographi- 
cal location. 


5S Browser Log Analysis 


In this section, we present our browser logs analysis. In 
our earlier work, we performed analyses on document 
content and popularity, distribution of user sessions, and 
system load [|]. For the sake of completeness, we first 
summarize the major findings of our previous analysis, 
and then study the temporal stability and spatial locality 
of user accesses, as well as the distribution of the load 
placed by different users on the web server. 


5.1. Summary of previous analysis 


In [1], we analyzed the browser log collected during 
the period from August 15, 2000 through August 26, 
2000. During this time the web server received 1.6 — 3.2 
million requests per day from 64,000 — 98,000 distinct 
clients. Below is a synopsis of our major findings: 
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1. The distribution of document popularity does not 
closely follow Zipf-like distribution, where a doc- 
ument is defined as a unique URL or as a unique 
URL and parameter pair. The majority of requests 
are concentrated on a small number of documents. 
In particular, we found that 0.1% — 0.5% of the doc- 
uments (i.e., approximately 121 — 442) account for 
90% of the requests. 


2. More than 60% of the pages accessed at the web 
server are due to offline PDA users and less than 
7% of the accesses are due to wireless clients; the 
remaining accesses are due to desktop clients for 
registration and customization services. 


3. Our analysis for the distribution of reply sizes 
showed that most of the replies to wireless clients 
are less than 3 KBytes. For offline clients, most of 
the replies are less than 6 KBytes. The reply size 
distribution for the two types of clients is similar. 


4. Our user session analysis showed that users tend 
to have short sessions when interacting with the 
web site: 95% of the sessions were less than 3 
minutes. We empirically determined the session- 
activity threshold to be somewhere between 30 to 
45 seconds (i.e., if no request is received from a 
client for such a duration, it implies that the old ses- 
sion has ended). 


5. Our category analysis showed that stock quotes, 
news, and yellow pages are the top categories ac- 
cessed by wireless clients. For offiine clients, help 
is the most popular category followed by news and 
stock quotes. 


6. We observed that the relative importance of differ- 
ent categories did not change between weekdays 
and weekends (except stock quotes and sports). 
However, the amount of data accessed over the 
weekend drops by approximately 45%. 


These findings have the following performance implica- 
tions: 


1. The high concentration of requests to popular doc- 
uments in the browser log implies that caching the 
results of popular queries would be very effective 
in reducing the web server load. 


2. Since most replies sent to wireless and offline users 
are small (3 —6 KB), the wireless web server should 
be highly optimized in sending short replies, e.g., 
optimizing TCP slow start and re-start [15, 23] can 
be useful in this environment. 
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3. Our heuristic, based on user session analysis, to 
determine the sessirinactivity period can be use- 
ful to wireless service providers who want to re- 
claim IP addresses. Our amalysis showed that IP 
addresses may be reclaimed more quekly than the 
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First we study the requests from all users, 1.e., including 
wireless, offline, and desktop users. Figure 12 (a) and 
(b) plot the over lap between weekdays August 15 (Tues- 
day) and August 21 (Monday) versus other days (i.e., 
both weekend days and weekdays) (In Figure 12 (a) and 
(b), the curves with points are for pairs of weekdays, and 
those without points are for a weekday and weekend.) 
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mize the data layout to improve the performance of these 
queries. For example, workload-based techniques can 
be used to generate indices and materialized views auto- 
matically for a database [2]; these techniques are largely 
applicable if the database query workload is relatively 
stable (which is the case for our browser queries). 


Second, the overlap initially fluctuates with the increas - 
ing number of documents picked, and then decreases 
when the number of top documents picked is over 100. 
The initial fluctuation is probably due to the fact that al- 
though very popular documents tend to remain popular, 
their relative ranking does change over time. However, 
as we fiurther increase the number of documents, we may 
include some less popular documents. Since these doc- 
uments are less likely to remain popular than very pop- 
ular documents, the temporal overlap decreases. This 
phenomenon was also observed in [1 6]. 


Third, the overlap between pairs of weekdays is gener- 
ally higher than the overlap between a weekend day and 
a weekday. The overlap between two weekend days is 
even higher. This is consistent with our intuition, and 
Suggests that we should use past weekday workload to 
predict future weekday workload, and likewse use past 
weekend workload to predict future weekend workload. 


We also examine the requests coming from only the 
wireless users, and find the results are very similar. As 
before, the set of popular queries remains stable over 
time. The stability is especially high when we consider 
the most popular queries. In addition, there 1s a signifi- 
cant difference between the access pattern on weekdays 
versus that on weekends. 


5.2.2 Spatial locality 


In this section, we consider the following question: do 
people in the same geographical region tend to issue a 
similar set of queries. We employ the same approach 
as is used in studying the spatial locality for notification 
services (described in Section 4.3.1). 


Figure 14 compares the fraction of documents that are 
shared within a geographical cluster and within four 
random clusters, when we consider requests from all 
the users (excluding users with invalid IDs). The fig- 
ure shows that the curve for the geographical clusters 
overlaps with those for random clusters. This over- 
lap indicates that the degree of sharing between geo- 
graphical clustering and random clustering is compara- 
ble, and the correlation between users’ interest in brows- 
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ing over wireless channels and their geographical loca- 
tion 1s weak. 
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Figure 14: Local sharing between random sets of clients 


and clients that are geographically close together. 


A possible explanation for the weak correlation is that 
the popular browse content has global interest. In par- 
ticular, as mentioned in Section 5.1, 0.1% - 0.5% of the 
URL and parameter combinations (i.e., about 12] — 442 
unique combinations) account for 90% of the requests. 
With such a high concentration of user interest on a few 
documents, even when clients are picked at random, they 
share many requests; therefore, the geographical local- 
ity becomes insignificant. A similar phenomenon has 
been observed in a study of a popular news server [1 6], 
where the authors observed that the significance of do- 
main membership becomes diminished during a popular 
event. A major distinction between that study and ours 
is the way in which users are clustered: in that study, 
users are clustered based on their DNS names, whereas 
in our study we cluster users based on their geographical 
region, e.g. the city in which they reside. 


A natural question follows — why is there such a high 
concentration of interest in popular documents that even 
when clients are picked at random they share many 
documents? Examination of the most popular URLs 
and parameters shows that they include the front pages 
for email login, news, sports, weather, lottery, and the 
Signup application, as well as some popular stock quote 
queries. Intuitively, these queries are very popular to all 
users regardless of their physical locations. 


The lack of geographical locality implies that the web 
server's content can be replicated without keeping in 
mind the geographical location of the clients. 


We performed the same spatial locality analysis to re- 
quests issued only by wireless clients. Figure 15 sum- 
marizes the results. With geographical clustering, wire- 
less clients have slightly more sharing of documents 
than with random clustering; however, the distinction 
between the two clusterings is much less significant than 
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the diffierence observed for notification documents. This 
result suggests that using geographical locality of wire- 
less users as input for optimizing performance (or pro- 
viding content) will yield limited success. 
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Figure 15: Comparison of local sharing between random 
sets of wireless clients and wireless clients that are geo- 
graphically close together. 


§.2.3 Load distribution af different users 


In this section, we study the distribution of loads placed 
on the web server by diffterent users. Our earlier anal- 
ysis [1] examined the diffierence in load distribution be- 
tween wireless users and offline users. Now We look at 
the load distribution at a more fine-grained level — at a 
per-user level. 


Figure 16 and Figure 17 show the total number of ac- 
cesses and total number of data requested by different 
clients, respectively (users with invalid identifiers were 
discarded). As the figures show, there is a significant 
variation in the load placed by different users on the web 
server: some users request several orders of magnitude 
more documents/data than other users. The accesses 
from only the wireless clients reveal similar property. 
Thus, service providers can consider designing different 
pricing plans that to cater to the widely varying needs of 
different users. 
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Figure 16: Total number of accesses made by diffierent 
users. 
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Figure 17: Total number of data received by diffierent 
users. 


Figure !8 shows the inter-arrival time between requests 
coming from the same user. The requests generated from 
the offline users are much more bursty than those from 
the wireless users: 97% of the requests from the offline 
users have | second or less inter-arrival time, whereas 
only 9% of the requests from the wireless users have 
comparable inter-arrival time. We observe very bursty 
traffic for offline PDA users because their requests are 
generated by the downloader program rather than a hu- 
man being; these users also generate significantly more 
requests than wireless users. If not handled appropri- 
ately, such bursts can delay wireless users unnecessar- 
ily. The web site designers can address this problem 
in a number of ways. For example, they can provide 
higher priority to wireless users or restrict the burst of 
offline user requests to a few front-door servers (servers 
that handle incoming HITP requests). An orthogonal 
efficiency issue that needs to be addressed is the syn- 
chronization protocol for PDAs, 1.e., instead of sending 
a large number of small requests, the synchronization 
protocol could batch all these requests into a single re- 
quest and reduce the server load and roundtrip latency. 
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Figure 18: CDF of inter-arrival time between consecu- 
tive requests from the same user. 
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6 Correlation between notifications and 
browsing 


Having studied both the notification logs and the browse 
logs, it is useful to understand whether there is any corre- 
lation between the browsing and notification activities of 
users. We are interested in answering questions such as: 
(1) do users utilize one of the services significantly more 
than other services, and (ii) does their interest in par- 
ticular content categories differ across the two services. 
We use the notification and browser logs, both spanning 
from August 15, 2000 through August 26, 2000 for the 
following analysis. 


6.1 Correlation in the amount of usage 


Figure 19 shows the average number of notification mes- 
Sages versus the number of browse requests, and the av- 
erage number of browse requests versus the number of 
notification messages. There is little correlation between 
the two variables: the number of notification messages 
fluctuates widely with the number of browse requests; 
similarly, the number of browse requests also shows no 
obvious trend with respect to the number of notification 
messages. The correlation coefficient between these two 
variables is 0.265 when considering all users, and 0.125 
when considering only wireless users. The low correla- 
tion coefficients implies that web site designers cannot 
predict a user’s browsing activity based on his/her noti- 
fication activity, and vice versa. 


6.2 Correlation in popular content categories 


We now look at the question whether users are interested 
in a similar set of content categories across the two ser- 
vices. To answer this we take the following approach: 
first, we classify notification messages and browsing ac- 
cesses into different categories. (The details of catego- 
rizing notifications are described in Section 4.2, and the 
details of categorizing browse accesses are described in 
our earlier work [{1].) Then for each individual user, we 
pick the top NV content categories in browsing and top 
N content categories in notification (if the next few cat- 
egories after the NV" category have the same frequency 
of access as the N““ category, we include those cate- 
gories as well for the top N case). 


Figure 20 shows the percentage of users who have at 
least some overlap between their top NV browse and noti- 
fication categories. The degree of overlap is much higher 
when we consider wireless users only. For example, for 
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Figure 19: Correlation between the number of browse 
requests and notifications of wireless users. 


the top 3 categories, the percentage of overlapped users 
is less than 10% when considering all the users, and 
around 50% when considering only the wireless users. 
On the other hand, even when considering wireless users 
only, the number of overlapped users is never more than 
65%. 
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Figure 20: Number of users who have overlap between 
their top NV browsing categories and top N notification 
categories. 


We now compare the extent of the overlap by varying 
N from | to the total number of categories. The results 
are shown in Figure 21. The figure shows the average 
percentage of overlap between two categories, where the 
average overlap is computed as follows: 
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Figure 21: Correlation between the number of browse 
requests and notifications of wireless users. 


- categories overlapped for user; 
z min(N,min(BC,NC 


overlapnrigh = 
S relevant users 
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where BC' denotes the number of browse categories, 
NC denotes the number of notification categories, and 
relevant users refers to those users that have at least one 
browse record and one notification record in the respec- 
tive logs. We show the results for only the top 9 cate- 
gories, since the values beyond that are stable. 


Essentially these ratios compute the percentage of over- 
lap for each individual user, and then take the aver- 
age of these percentages over all wireless users or all 
users. Since not all users have at least NV browsing 
or notification categories, we compute overlapp;g, and 
overlapiow, Where the former computes the percentage 
of overlap by using the minimum of BC and NC, and 
the latter uses the maximum of BC and NC. The fig- 
ure shows that the amount of overlap is considerably 
higher when considering only wireless users. For ex- 
ample, for the top three categories, the overlap is less 
than 7% when considering all users. In comparison, for 
wireless users, the overlapjow and overlap, ji¢,, values 
are 21% and 36%, respectively. We also observe that the 
effiect of increasing NV is small. Even when N is 8, the 
percentage of overlap is less than 50% for wireless users. 


The above results indicate that wireless users have mod- 
erate correlation in the way they use browse and notifi- 
cation services. In comparison, the correlation is much 
lower when considering all users. This is because the 
most popular browsing categories for desktop users are 


sign-up services, direction, and general help, whereas 
notification 1s usually not used to deliver these types of 
content. On the other hand, some wireless users are 
interested in both browsing and receiving notifications 
about emails, stock quotes, personalization, news and 
sports. However, the degree of correlation is limited, 
and service providers cannot solely rely on a user’s noti- 
fication profile to determine what content he/she may be 
interested in browsing. 


7 Conclusions 


Internet access via small handheld devices is expected to 
increase tremendously in the next few years. In this pa- 
per, we analyzed the access patterns of a large web site 
designed primarily for wireless and handheld mobile de- 
vices. The web site provides both browse and notifica- 
tion services. To our knowledge, this 1s a first-of-a-kind 
study that analyzes notification services. It is also first 
in analyzing user behavior using a commercial web site. 
We believe this is an important first step in the direction 
of understanding the dynamics of wireless Internet ser- 
VICes. 
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